linux.git/mm/rmap.c, branch v6.7

mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()

2023-10-18T21:34:14+00:00

Let's convert it to consume a folio.

[akpm@linux-foundation.org: fix kerneldoc]
Link: https://lkml.kernel.org/r/20231002142949.235104-3-david@redhat.com
Signed-off-by: David Hildenbrand 
Reviewed-by: Suren Baghdasaryan 
Reviewed-by: Vishal Moola (Oracle) 
Cc: Mike Kravetz 
Cc: Muchun Song 
Cc: Matthew Wilcox 
Signed-off-by: Andrew Morton

mm/rmap: move SetPageAnonExclusive() out of page_move_anon_rmap()

2023-10-18T21:34:14+00:00

Patch series "mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()".

Convert page_move_anon_rmap() to folio_move_anon_rmap(), letting the
callers handle PageAnonExclusive.  I'm including cleanup patch #3 because
it fits into the picture and can be done cleaner by the conversion.


This patch (of 3):

Let's move it into the caller: there is a difference between whether an
anon folio can only be mapped by one process (e.g., into one VMA), and
whether it is truly exclusive (e.g., no references -- including GUP --
from other processes).

Further, for large folios the page might not actually be pointing at the
head page of the folio, so it better be handled in the caller.  This is a
preparation for converting page_move_anon_rmap() to consume a folio.

Link: https://lkml.kernel.org/r/20231002142949.235104-1-david@redhat.com
Link: https://lkml.kernel.org/r/20231002142949.235104-2-david@redhat.com
Signed-off-by: David Hildenbrand 
Reviewed-by: Suren Baghdasaryan 
Reviewed-by: Vishal Moola (Oracle) 
Cc: Mike Kravetz 
Cc: Muchun Song 
Cc: Matthew Wilcox 
Signed-off-by: Andrew Morton

mm: handle large folio when large folio in VM_LOCKED VMA range

2023-10-04T17:32:32+00:00

If large folio is in the range of VM_LOCKED VMA, it should be mlocked to
avoid being picked by page reclaim.  Which may split the large folio and
then mlock each pages again.

Mlock this kind of large folio to prevent them being picked by page
reclaim.

For the large folio which cross the boundary of VM_LOCKED VMA or not fully
mapped to VM_LOCKED VMA, we'd better not to mlock it.  So if the system is
under memory pressure, this kind of large folio will be split and the
pages ouf of VM_LOCKED VMA can be reclaimed.

Ideally, for large folio, we should mlock it when the large folio is fully
mapped to VMA and munlock it if any page are unmampped from VMA.  But it's
not easy to detect whether the large folio is fully mapped to VMA in some
cases (like add/remove rmap).  So we update mlock_vma_folio() and
munlock_vma_folio() to mlock/munlock the folio according to vma->vm_flags.
Let caller to decide whether they should call these two functions.

For add rmap, only mlock normal 4K folio and postpone large folio handling
to page reclaim phase.  It is possible to reuse page table iterator to
detect whether folio is fully mapped or not during page reclaim phase. 
For remove rmap, invoke munlock_vma_folio() to munlock folio unconditionly
because rmap makes folio not fully mapped to VMA.

Link: https://lkml.kernel.org/r/20230918073318.1181104-3-fengwei.yin@intel.com
Signed-off-by: Yin Fengwei 
Cc: David Hildenbrand 
Cc: Hugh Dickins 
Cc: Matthew Wilcox (Oracle) 
Cc: Ryan Roberts 
Cc: Yang Shi 
Cc: Yosry Ahmed 
Cc: Yu Zhao 
Signed-off-by: Andrew Morton

mm/rmap: pass folio to hugepage_add_anon_rmap()

2023-10-04T17:32:27+00:00

Let's pass a folio; we are always mapping the entire thing.

Link: https://lkml.kernel.org/r/20230913125113.313322-7-david@redhat.com
Signed-off-by: David Hildenbrand 
Cc: Matthew Wilcox 
Cc: Mike Kravetz 
Cc: Muchun Song 
Signed-off-by: Andrew Morton

mm/rmap: simplify PageAnonExclusive sanity checks when adding anon rmap

2023-10-04T17:32:27+00:00

Let's sanity-check PageAnonExclusive vs.  mapcount in page_add_anon_rmap()
and hugepage_add_anon_rmap() after setting PageAnonExclusive simply by
re-reading the mapcounts.

We can stop initializing the "first" variable in page_add_anon_rmap() and
no longer need an atomic_inc_and_test() in hugepage_add_anon_rmap().

While at it, switch to VM_WARN_ON_FOLIO().

[david@redhat.com: update check for doubly-mapped page]
  Link: https://lkml.kernel.org/r/d8e5a093-2e22-c14b-7e64-6da280398d9f@redhat.com
Link: https://lkml.kernel.org/r/20230913125113.313322-6-david@redhat.com
Signed-off-by: David Hildenbrand 
Cc: Matthew Wilcox 
Cc: Mike Kravetz 
Cc: Muchun Song 
Signed-off-by: Andrew Morton

mm/rmap: warn on new PTE-mapped folios in page_add_anon_rmap()

2023-10-04T17:32:27+00:00

If swapin code would ever decide to not use order-0 pages and supply a
PTE-mapped large folio, we will have to change how we call
__folio_set_anon() -- eventually with exclusive=false and an adjusted
address.  For now, let's add a VM_WARN_ON_FOLIO() with a comment about the
situation.

Link: https://lkml.kernel.org/r/20230913125113.313322-5-david@redhat.com
Signed-off-by: David Hildenbrand 
Cc: Matthew Wilcox 
Cc: Mike Kravetz 
Cc: Muchun Song 
Signed-off-by: Andrew Morton

mm/rmap: move folio_test_anon() check out of __folio_set_anon()

2023-10-04T17:32:27+00:00

Let's handle it in the caller; no need for the "first" check based on the
mapcount.

We really only end up with !anon pages in page_add_anon_rmap() via
do_swap_page(), where we hold the folio lock.  So races are not possible. 
Add a VM_WARN_ON_FOLIO() to make sure that we really hold the folio lock.

In the future, we might want to let do_swap_page() use
folio_add_new_anon_rmap() on new pages instead: however, we might have to
pass then whether the folio is exclusive or not.  So keep it in there for
now.

For hugetlb we never expect to have a non-anon page in
hugepage_add_anon_rmap().  Remove that code, along with some other checks
that are either not required or were checked in
hugepage_add_new_anon_rmap() already.

Link: https://lkml.kernel.org/r/20230913125113.313322-4-david@redhat.com
Signed-off-by: David Hildenbrand 
Cc: Matthew Wilcox 
Cc: Mike Kravetz 
Cc: Muchun Song 
Signed-off-by: Andrew Morton

mm/rmap: move SetPageAnonExclusive out of __page_set_anon_rmap()

2023-10-04T17:32:27+00:00

Let's handle it in the caller.  No need to pass the page.  While at it,
rename the function to __folio_set_anon() and pass "bool exclusive"
instead of "int exclusive".

Link: https://lkml.kernel.org/r/20230913125113.313322-3-david@redhat.com
Signed-off-by: David Hildenbrand 
Cc: Matthew Wilcox 
Cc: Mike Kravetz 
Cc: Muchun Song 
Signed-off-by: Andrew Morton

mm/rmap: drop stale comment in page_add_anon_rmap and hugepage_add_anon_rmap()

2023-10-04T17:32:27+00:00

Patch series "Anon rmap cleanups".

Some cleanups around rmap for anon pages.  I'm working on more cleanups
also around file rmap -- also to handle the "compound" parameter
internally only and to let hugetlb use page_add_file_rmap(), but these
changes make sense separately.


This patch (of 6):

That comment was added in commit 5dbe0af47f8a ("mm: fix kernel BUG at
mm/rmap.c:1017!") to document why we can see vma->vm_end getting adjusted
concurrently due to a VMA split.

However, the optimized locking code was changed again in bf181b9f9d8 ("mm
anon rmap: replace same_anon_vma linked list with an interval tree.").

...  and later, the comment was changed in commit 0503ea8f5ba7 ("mm/mmap:
remove __vma_adjust()") to talk about "vma_merge" although the original
issue was with VMA splitting.

Let's just remove that comment.  Nowadays, it's outdated, imprecise and
confusing.

Link: https://lkml.kernel.org/r/20230913125113.313322-1-david@redhat.com
Link: https://lkml.kernel.org/r/20230913125113.313322-2-david@redhat.com
Signed-off-by: David Hildenbrand 
Cc: Mike Kravetz 
Cc: Muchun Song 
Cc: Matthew Wilcox 
Signed-off-by: Andrew Morton

mm: hugetlb: add huge page size param to set_huge_pte_at()

2023-09-30T00:20:47+00:00

Patch series "Fix set_huge_pte_at() panic on arm64", v2.

This series fixes a bug in arm64's implementation of set_huge_pte_at(),
which can result in an unprivileged user causing a kernel panic.  The
problem was triggered when running the new uffd poison mm selftest for
HUGETLB memory.  This test (and the uffd poison feature) was merged for
v6.5-rc7.

Ideally, I'd like to get this fix in for v6.6 and I've cc'ed stable
(correctly this time) to get it backported to v6.5, where the issue first
showed up.


Description of Bug
==================

arm64's huge pte implementation supports multiple huge page sizes, some of
which are implemented in the page table with multiple contiguous entries. 
So set_huge_pte_at() needs to work out how big the logical pte is, so that
it can also work out how many physical ptes (or pmds) need to be written. 
It previously did this by grabbing the folio out of the pte and querying
its size.

However, there are cases when the pte being set is actually a swap entry. 
But this also used to work fine, because for huge ptes, we only ever saw
migration entries and hwpoison entries.  And both of these types of swap
entries have a PFN embedded, so the code would grab that and everything
still worked out.

But over time, more calls to set_huge_pte_at() have been added that set
swap entry types that do not embed a PFN.  And this causes the code to go
bang.  The triggering case is for the uffd poison test, commit
99aa77215ad0 ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which
causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit
8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") -
added in v6.5-rc7.  Although review shows that there are other call sites
that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger
on arm64 because arm64 doesn't support UFFD WP.

If CONFIG_DEBUG_VM is enabled, we do at least get a BUG(), but otherwise,
it will dereference a bad pointer in page_folio():

    static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry)
    {
        VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry));

        return page_folio(pfn_to_page(swp_offset_pfn(entry)));
    }


Fix
===

The simplest fix would have been to revert the dodgy cleanup commit
18f3962953e4 ("mm: hugetlb: kill set_huge_swap_pte_at()"), but since
things have moved on, this would have required an audit of all the new
set_huge_pte_at() call sites to see if they should be converted to
set_huge_swap_pte_at().  As per the original intent of the change, it
would also leave us open to future bugs when people invariably get it
wrong and call the wrong helper.

So instead, I've added a huge page size parameter to set_huge_pte_at(). 
This means that the arm64 code has the size in all cases.  It's a bigger
change, due to needing to touch the arches that implement the function,
but it is entirely mechanical, so in my view, low risk.

I've compile-tested all touched arches; arm64, parisc, powerpc, riscv,
s390, sparc (and additionally x86_64).  I've additionally booted and run
mm selftests against arm64, where I observe the uffd poison test is fixed,
and there are no other regressions.


This patch (of 2):

In order to fix a bug, arm64 needs to be told the size of the huge page
for which the pte is being set in set_huge_pte_at().  Provide for this by
adding an `unsigned long sz` parameter to the function.  This follows the
same pattern as huge_pte_clear().

This commit makes the required interface modifications to the core mm as
well as all arches that implement this function (arm64, parisc, powerpc,
riscv, s390, sparc).  The actual arm64 bug will be fixed in a separate
commit.

No behavioral changes intended.

Link: https://lkml.kernel.org/r/20230922115804.2043771-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20230922115804.2043771-2-ryan.roberts@arm.com
Fixes: 8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs")
Signed-off-by: Ryan Roberts 
Reviewed-by: Christophe Leroy 	[powerpc 8xx]
Reviewed-by: Lorenzo Stoakes 	[vmalloc change]
Cc: Alexandre Ghiti 
Cc: Albert Ou 
Cc: Alexander Gordeev 
Cc: Anshuman Khandual 
Cc: Arnd Bergmann 
Cc: Axel Rasmussen 
Cc: Catalin Marinas 
Cc: Christian Borntraeger 
Cc: Christoph Hellwig 
Cc: David S. Miller 
Cc: Gerald Schaefer 
Cc: Heiko Carstens 
Cc: Helge Deller 
Cc: "James E.J. Bottomley" 
Cc: Mike Kravetz 
Cc: Muchun Song 
Cc: Nicholas Piggin 
Cc: Palmer Dabbelt 
Cc: Paul Walmsley 
Cc: Peter Xu 
Cc: Qi Zheng 
Cc: Ryan Roberts 
Cc: SeongJae Park 
Cc: Sven Schnelle 
Cc: Uladzislau Rezki (Sony) 
Cc: Vasily Gorbik 
Cc: Will Deacon 
Cc: 	[6.5+]
Signed-off-by: Andrew Morton