linux.git/include/linux/mm_inline.h, branch v7.1-rc3

mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers

2026-04-18T07:10:47+00:00

We must ensure the folio is deleted from or added to the correct lruvec
list.  So, add VM_WARN_ON_ONCE_FOLIO() to catch invalid users.  The
VM_BUG_ON_PAGE() in move_pages_to_lru() can be removed as
add_page_to_lru_list() will perform the necessary check.

Link: https://lore.kernel.org/2c90fc006d9d730331a3caeef96f7e5dabe2036d.1772711148.git.zhengqi.arch@bytedance.com
Signed-off-by: Muchun Song 
Signed-off-by: Qi Zheng 
Acked-by: Roman Gushchin 
Acked-by: Johannes Weiner 
Acked-by: Shakeel Butt 
Cc: Allen Pais 
Cc: Axel Rasmussen 
Cc: Baoquan He 
Cc: Chengming Zhou 
Cc: Chen Ridong 
Cc: David Hildenbrand 
Cc: Hamza Mahfooz 
Cc: Harry Yoo 
Cc: Hugh Dickins 
Cc: Imran Khan 
Cc: Kamalesh Babulal 
Cc: Lance Yang 
Cc: Liam Howlett 
Cc: Lorenzo Stoakes (Oracle) 
Cc: Michal Hocko 
Cc: Michal Koutný 
Cc: Mike Rapoport 
Cc: Muchun Song 
Cc: Nhat Pham 
Cc: Suren Baghdasaryan 
Cc: Usama Arif 
Cc: Vlastimil Babka 
Cc: Wei Xu 
Cc: Yosry Ahmed 
Cc: Yuanchu Xie 
Cc: Zi Yan 
Signed-off-by: Andrew Morton

mm: remove unused page_is_file_lru() function

2026-04-05T20:53:36+00:00

The page_is_file_lru() wrapper function is no longer used.  The kernel has
moved to folio-based APIs, and all callers should use folio_is_file_lru()
instead.

Remove the obsolete page-based wrapper function.

Link: https://lkml.kernel.org/r/20260323090305.798057-1-ye.liu@linux.dev
Signed-off-by: Ye Liu 
Acked-by: David Hildenbrand (Arm) 
Reviewed-by: Lorenzo Stoakes (Oracle) 
Signed-off-by: Andrew Morton

mm/mglru: fix cgroup OOM during MGLRU state switching

2026-04-05T20:53:33+00:00

When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
condition exists between the state switching and the memory reclaim path. 
This can lead to unexpected cgroup OOM kills, even when plenty of
reclaimable memory is available.

Problem Description
==================
The issue arises from a "reclaim vacuum" during the transition.

1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
   false before the pages are drained from MGLRU lists back to traditional
   LRU lists.
2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
   and skip the MGLRU path.
3. However, these pages might not have reached the traditional LRU lists
   yet, or the changes are not yet visible to all CPUs due to a lack
   of synchronization.
4. get_scan_count() subsequently finds traditional LRU lists empty,
   concludes there is no reclaimable memory, and triggers an OOM kill.

A similar race can occur during enablement, where the reclaimer sees the
new state but the MGLRU lists haven't been populated via fill_evictable()
yet.

Solution
========
Introduce a 'switching' state (`lru_switch`) to bridge the transition.
When transitioning, the system enters this intermediate state where
the reclaimer is forced to attempt both MGLRU and traditional reclaim
paths sequentially. This ensures that folios remain visible to at least
one reclaim mechanism until the transition is fully materialized across
all CPUs.

Race & Mitigation
================
A race window exists between checking the 'draining' state and performing
the actual list operations. For instance, a reclaimer might observe the
draining state as false just before it changes, leading to a suboptimal
reclaim path decision.

However, this impact is effectively mitigated by the kernel's reclaim
retry mechanism (e.g., in do_try_to_free_pages). If a reclaimer pass fails
to find eligible folios due to a state transition race, subsequent retries
in the loop will observe the updated state and correctly direct the scan
to the appropriate LRU lists. This ensures the transient inconsistency
does not escalate into a terminal OOM kill.

This effectively reduce the race window that previously triggered OOMs
under high memory pressure.

This fix has been verified on v7.0.0-rc1; dynamic toggling of MGLRU
functions correctly without triggering unexpected OOM kills.

Link: https://lkml.kernel.org/r/20260319-b4-switch-mglru-v2-v5-1-8898491e5f17@gmail.com
Signed-off-by: Leno Hou 
Acked-by: Yafang Shao 
Reviewed-by: Barry Song 
Reviewed-by: Axel Rasmussen 
Cc: Yuanchu Xie 
Cc: Wei Xu 
Cc: Jialing Wang 
Cc: Yu Zhao 
Cc: Kairui Song 
Cc: Bingfang Guo 
Signed-off-by: Andrew Morton

mm: userfaultfd: add pgtable_supports_uffd_wp()

2025-11-24T23:08:54+00:00

Some platforms can customize the PTE/PMD entry uffd-wp bit making it
unavailable even if the architecture provides the resource.  This patch
adds a macro API pgtable_supports_uffd_wp() that allows architectures to
define their specific implementations to check if the uffd-wp bit is
available on which device the kernel is running.

Also this patch is removing "ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP" and
"ifdef CONFIG_PTE_MARKER_UFFD_WP" in favor of pgtable_supports_uffd_wp()
and uffd_supports_wp_marker() checks respectively that default to
IS_ENABLED(CONFIG_HAVE_ARCH_USERFAULTFD_WP) and
"IS_ENABLED(CONFIG_HAVE_ARCH_USERFAULTFD_WP) &&
IS_ENABLED(CONFIG_PTE_MARKER_UFFD_WP)" if not overridden by the
architecture, no change in behavior is expected.

Link: https://lkml.kernel.org/r/20251113072806.795029-3-zhangchunyan@iscas.ac.cn
Signed-off-by: Chunyan Zhang 
Acked-by: David Hildenbrand 
Cc: Albert Ou 
Cc: Alexandre Ghiti 
Cc: Alexandre Ghiti 
Cc: Al Viro 
Cc: Andrew Jones 
Cc: Arnd Bergmann 
Cc: Axel Rasmussen 
Cc: Christian Brauner 
Cc: Conor Dooley 
Cc: Conor Dooley 
Cc: Deepak Gupta 
Cc: Jan Kara 
Cc: Liam Howlett 
Cc: Lorenzo Stoakes 
Cc: Michal Hocko 
Cc: Mike Rapoport 
Cc: Palmer Dabbelt 
Cc: Paul Walmsley 
Cc: Peter Xu 
Cc: Rob Herring 
Cc: Suren Baghdasaryan 
Cc: Vlastimil Babka 
Cc: Yuanchu Xie 
Signed-off-by: Andrew Morton

memcg: remove __mod_lruvec_state

2025-11-24T23:08:54+00:00

__mod_lruvec_state() is already safe against irqs, so there is no need to
have a separate interface (i.e.  mod_lruvec_state) which wraps calls to it
with irq disabling and reenabling.  Let's rename __mod_lruvec_state() to
mod_lruvec_state().

Link: https://lkml.kernel.org/r/20251110232008.1352063-4-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt 
Reviewed-by: Harry Yoo 
Acked-by: Roman Gushchin 
Acked-by: Vlastimil Babka 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Muchun Song 
Cc: Qi Zheng 
Signed-off-by: Andrew Morton

mm: introduce leaf entry type and use to simplify leaf entry logic

2025-11-24T23:08:50+00:00

The kernel maintains leaf page table entries which contain either:

The kernel maintains leaf page table entries which contain either:

 - Nothing ('none' entries)
 - Present entries*
 - Everything else that will cause a fault which the kernel handles

* Present entries are either entries the hardware can navigate without page
  fault or special cases like NUMA hint protnone or PMD with cleared
  present bit which contain hardware-valid entries modulo the present bit.

In the 'everything else' group we include swap entries, but we also
include a number of other things such as migration entries, device private
entries and marker entries.

Unfortunately this 'everything else' group expresses everything through a
swp_entry_t type, and these entries are referred to swap entries even
though they may well not contain a...  swap entry.

This is compounded by the rather mind-boggling concept of a non-swap swap
entry (checked via non_swap_entry()) and the means by which we twist and
turn to satisfy this.

This patch lays the foundation for reducing this confusion.

We refer to 'everything else' as a 'software-define leaf entry' or
'softleaf'.  for short And in fact we scoop up the 'none' entries into
this concept also so we are left with:

- Present entries.
- Softleaf entries (which may be empty).

This allows for radical simplification across the board - one can simply
convert any leaf page table entry to a leaf entry via softleaf_from_pte().

If the entry is present, we return an empty leaf entry, so it is assumed
the caller is aware that they must differentiate between the two
categories of page table entries, checking for the former via
pte_present().

As a result, we can eliminate a number of places where we would otherwise
need to use predicates to see if we can proceed with leaf page table entry
conversion and instead just go ahead and do it unconditionally.

We do so where we can, adjusting surrounding logic as necessary to
integrate the new softleaf_t logic as far as seems reasonable at this
stage.

We typedef swp_entry_t to softleaf_t for the time being until the
conversion can be complete, meaning everything remains compatible
regardless of which type is used.  We will eventually remove swp_entry_t
when the conversion is complete.

We introduce a new header file to keep things clear - leafops.h - this
imports swapops.h so can direct replace swapops imports without issue, and
we do so in all the files that require it.

Additionally, add new leafops.h file to core mm maintainers entry.

Link: https://lkml.kernel.org/r/c879383aac77d96a03e4d38f7daba893cd35fc76.1762812360.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes 
Acked-by: Zi Yan 
Reviewed-by: Vlastimil Babka 
Cc: Alexander Gordeev 
Cc: Alistair Popple 
Cc: Al Viro 
Cc: Arnd Bergmann 
Cc: Axel Rasmussen 
Cc: Baolin Wang 
Cc: Baoquan He 
Cc: Barry Song 
Cc: Byungchul Park 
Cc: Chengming Zhou 
Cc: Chris Li 
Cc: Christian Borntraeger 
Cc: Christian Brauner 
Cc: Claudio Imbrenda 
Cc: David Hildenbrand 
Cc: Dev Jain 
Cc: Gerald Schaefer 
Cc: Gregory Price 
Cc: Heiko Carstens 
Cc: "Huang, Ying" 
Cc: Hugh Dickins 
Cc: Jan Kara 
Cc: Jann Horn 
Cc: Janosch Frank 
Cc: Jason Gunthorpe 
Cc: Joshua Hahn 
Cc: Kairui Song 
Cc: Kemeng Shi 
Cc: Lance Yang 
Cc: Leon Romanovsky 
Cc: Liam Howlett 
Cc: Mathew Brost 
Cc: Matthew Wilcox (Oracle) 
Cc: Miaohe Lin 
Cc: Michal Hocko 
Cc: Mike Rapoport 
Cc: Muchun Song 
Cc: Naoya Horiguchi 
Cc: Nhat Pham 
Cc: Nico Pache 
Cc: Oscar Salvador 
Cc: Pasha Tatashin 
Cc: Peter Xu 
Cc: Rakie Kim 
Cc: Rik van Riel 
Cc: Ryan Roberts 
Cc: SeongJae Park 
Cc: Suren Baghdasaryan 
Cc: Sven Schnelle 
Cc: Vasily Gorbik 
Cc: Wei Xu 
Cc: xu xin 
Cc: Yuanchu Xie 
Signed-off-by: Andrew Morton

mm: introduce num_pages_contiguous()

2025-10-06T17:21:26+00:00

Let's add a simple helper for determining the number of contiguous pages
that represent contiguous PFNs.

In an ideal world, this helper would be simpler or not even required.
Unfortunately, on some configs we still have to maintain (SPARSEMEM
without VMEMMAP), the memmap is allocated per memory section, and we might
run into weird corner cases of false positives when blindly testing for
contiguous pages only.

One example of such false positives would be a memory section-sized hole
that does not have a memmap. The surrounding memory sections might get
"struct pages" that are contiguous, but the PFNs are actually not.

This helper will, for example, be useful for determining contiguous PFNs
in a GUP result, to batch further operations across returned "struct
page"s. VFIO will utilize this interface to accelerate the VFIO DMA map
process.

Implementation based on Linus' suggestions to avoid new usage of
nth_page() where avoidable.

Suggested-by: Linus Torvalds 
Suggested-by: Jason Gunthorpe 
Signed-off-by: Li Zhe 
Co-developed-by: David Hildenbrand 
Signed-off-by: David Hildenbrand 
Link: https://lore.kernel.org/r/20250814064714.56485-2-lizhe.67@bytedance.com
Signed-off-by: Alex Williamson

mm: constify various inline functions for improved const-correctness

2025-09-21T21:22:15+00:00

We select certain test functions plus folio_migrate_refs() from
mm_inline.h which either invoke each other, functions that are already
const-ified, or no further functions.

It is therefore relatively trivial to const-ify them, which provides a
basis for further const-ification further up the call stack.

One exception is the function folio_migrate_refs() which does write to the
"new" folio pointer; there, only the "old" folio pointer is being
constified; only its "flags" field is read, but nothing written.

Link: https://lkml.kernel.org/r/20250901205021.3573313-11-max.kellermann@ionos.com
Signed-off-by: Max Kellermann 
Reviewed-by: Vishal Moola (Oracle) 
Reviewed-by: Lorenzo Stoakes 
Acked-by: David Hildenbrand 
Acked-by: Vlastimil Babka 
Acked-by: Mike Rapoport (Microsoft) 
Acked-by: Shakeel Butt 
Cc: Alexander Gordeev 
Cc: Al Viro 
Cc: Andreas Larsson 
Cc: Andy Lutomirski 
Cc: Axel Rasmussen 
Cc: Baolin Wang 
Cc: Borislav Betkov 
Cc: Christian Borntraeger 
Cc: Christian Brauner 
Cc: Christian Zankel 
Cc: David Rientjes 
Cc: David S. Miller 
Cc: Gerald Schaefer 
Cc: Heiko Carstens 
Cc: Helge Deller 
Cc: "H. Peter Anvin" 
Cc: Hugh Dickins 
Cc: Ingo Molnar 
Cc: James Bottomley 
Cc: Jan Kara 
Cc: Jocelyn Falempe 
Cc: Liam Howlett 
Cc: Mark Brown 
Cc: Matthew Wilcox (Oracle) 
Cc: Max Filippov 
Cc: Michael Ellerman 
Cc: Michal Hocko 
Cc: "Nysal Jan K.A" 
Cc: Oscar Salvador 
Cc: Peter Zijlstra 
Cc: Russel King 
Cc: Suren Baghdasaryan 
Cc: Sven Schnelle 
Cc: Thomas Gleinxer 
Cc: Thomas Huth 
Cc: Vasily Gorbik 
Cc: Wei Xu 
Cc: Yuanchu Xie 
Signed-off-by: Andrew Morton

mm: introduce memdesc_flags_t

2025-09-13T23:55:07+00:00

Patch series "Add and use memdesc_flags_t".

At some point struct page will be separated from struct slab and struct
folio.  This is a step towards that by introducing a type for the 'flags'
word of all three structures.  This gives us a certain amount of type
safety by establishing that some of these unsigned longs are different
from other unsigned longs in that they contain things like node ID,
section number and zone number in the upper bits.  That lets us have
functions that can be easily called by anyone who has a slab, folio or
page (but not easily by anyone else) to get the node or zone.

There's going to be some unusual merge problems with this as some odd bits
of the kernel decide they want to print out the flags value or something
similar by writing page->flags and now they'll need to write page->flags.f
instead.  That's most of the churn here.  Maybe we should be removing
these things from the debug output?


This patch (of 11):

Wrap the unsigned long flags in a typedef.  In upcoming patches, this will
provide a strong hint that you can't just pass a random unsigned long to
functions which take this as an argument.

[willy@infradead.org: s/flags/flags.f/ in several architectures]
  Link: https://lkml.kernel.org/r/aKMgPRLD-WnkPxYm@casper.infradead.org
[nicola.vetrini@gmail.com: mips: fix compilation error]
  Link: https://lore.kernel.org/lkml/CA+G9fYvkpmqGr6wjBNHY=dRp71PLCoi2341JxOudi60yqaeUdg@mail.gmail.com/
  Link: https://lkml.kernel.org/r/20250825214245.1838158-1-nicola.vetrini@gmail.com
Link: https://lkml.kernel.org/r/20250805172307.1302730-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20250805172307.1302730-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) 
Signed-off-by: Matthew Wilcox (Oracle) 
Acked-by: Zi Yan 
Cc: Shakeel Butt 
Signed-off-by: Andrew Morton

mm: convert VM_PFNMAP tracking to pfnmap_track() + pfnmap_untrack()

2025-05-22T21:55:37+00:00

Let's use our new interface.  In remap_pfn_range(), we'll now decide
whether we have to track (full VMA covered) or only lookup the cachemode
(partial VMA covered).

Remember what we have to untrack by linking it from the VMA.  When
duplicating VMAs (e.g., splitting, mremap, fork), we'll handle it similar
to anon VMA names, and use a kref to share the tracking.

Once the last VMA un-refs our tracking data, we'll do the untracking,
which simplifies things a lot and should sort our various issues we saw
recently, for example, when partially unmapping/zapping a tracked VMA.

This change implies that we'll keep tracking the original PFN range even
after splitting + partially unmapping it: not too bad, because it was not
working reliably before.  The only thing that kind-of worked before was
shrinking such a mapping using mremap(): we managed to adjust the
reservation in a hacky way, now we won't adjust the reservation but leave
it around until all involved VMAs are gone.

If that ever turns out to be an issue, we could hook into VM splitting
code and split the tracking; however, that adds complexity that might not
be required, so we'll keep it simple for now.

Link: https://lkml.kernel.org/r/20250512123424.637989-5-david@redhat.com
Signed-off-by: David Hildenbrand 
Acked-by: Ingo Molnar 	[x86 bits]
Reviewed-by: Lorenzo Stoakes 
Reviewed-by: Liam R. Howlett 
Cc: Andy Lutomirski 
Cc: Borislav Betkov 
Cc: Dave Airlie 
Cc: "H. Peter Anvin" 
Cc: Jani Nikula 
Cc: Jann Horn 
Cc: Jonas Lahtinen 
Cc: "Masami Hiramatsu (Google)" 
Cc: Mathieu Desnoyers 
Cc: Peter Xu 
Cc: Peter Zijlstra 
Cc: Rodrigo Vivi 
Cc: Steven Rostedt 
Cc: Thomas Gleinxer 
Cc: Tvrtko Ursulin 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton