linux-stable.git/fs/proc/task_mmu.c, branch linux-6.2.y

mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps

2023-02-01T00:44:09+00:00

Patch series "Fixes for hugetlb mapcount at most 1 for shared PMDs".

This issue of mapcount in hugetlb pages referenced by shared PMDs was
discussed in [1].  The following two patches address user visible behavior
caused by this issue.

[1] https://lore.kernel.org/linux-mm/Y9BF+OCdWnCSilEu@monkey/


This patch (of 2):

A hugetlb page will have a mapcount of 1 if mapped by multiple processes
via a shared PMD.  This is because only the first process increases the
map count, and subsequent processes just add the shared PMD page to their
page table.

page_mapcount is being used to decide if a hugetlb page is shared or
private in /proc/PID/smaps.  Pages referenced via a shared PMD were
incorrectly being counted as private.

To fix, check for a shared PMD if mapcount is 1.  If a shared PMD is found
count the hugetlb page as shared.  A new helper to check for a shared PMD
is added.

[akpm@linux-foundation.org: simplification, per David]
[akpm@linux-foundation.org: hugetlb.h: include page_ref.h for page_count()]
Link: https://lkml.kernel.org/r/20230126222721.222195-2-mike.kravetz@oracle.com
Fixes: 25ee01a2fca0 ("mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps")
Signed-off-by: Mike Kravetz 
Acked-by: Peter Xu 
Cc: David Hildenbrand 
Cc: James Houghton 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: Muchun Song 
Cc: Naoya Horiguchi 
Cc: Vishal Moola (Oracle) 
Cc: Yang Shi 
Cc: 
Signed-off-by: Andrew Morton

mm: do not show fs mm pc for VM_LOCKONFAULT pages

2022-12-12T02:12:21+00:00

When VM_LOCKONFAULT was added, /proc/PID/smaps wasn't hooked up to it, so
looking at /proc/PID/smaps, it shows '??' instead of something
intelligable.  This can be reached by userspace by simply calling
`mlock2(..., MLOCK_ONFAULT);`.

Fix this by adding "lf" to denote VM_LOCKONFAULT.

Link: https://lkml.kernel.org/r/20221205173007.580210-1-Jason@zx2c4.com
Fixes: de60f5f10c58 ("mm: introduce VM_LOCKONFAULT")
Signed-off-by: Jason A. Donenfeld 
Acked-by: Vlastimil Babka 
Cc: Eric B Munson 
Cc: Kirill A. Shutemov 
Signed-off-by: Andrew Morton

mm: anonymous shared memory naming

2022-11-30T23:58:55+00:00

Since commit 9a10064f5625 ("mm: add a field to store names for private
anonymous memory"), name for private anonymous memory, but not shared
anonymous, can be set.  However, naming shared anonymous memory just as
useful for tracking purposes.

Extend the functionality to be able to set names for shared anon.

There are two ways to create anonymous shared memory, using memfd or
directly via mmap():
1. fd = memfd_create(...)
   mem = mmap(..., MAP_SHARED, fd, ...)
2. mem = mmap(..., MAP_SHARED | MAP_ANONYMOUS, -1, ...)

In both cases the anonymous shared memory is created the same way by
mapping an unlinked file on tmpfs.

The memfd way allows to give a name for anonymous shared memory, but
not useful when parts of shared memory require to have distinct names.

Example use case: The VMM maps VM memory as anonymous shared memory (not
private because VMM is sandboxed and drivers are running in their own
processes).  However, the VM tells back to the VMM how parts of the memory
are actually used by the guest, how each of the segments should be backed
(i.e.  4K pages, 2M pages), and some other information about the segments.
The naming allows us to monitor the effective memory footprint for each
of these segments from the host without looking inside the guest.

Sample output:
  /* Create shared anonymous segmenet */
  anon_shmem = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                    MAP_SHARED | MAP_ANONYMOUS, -1, 0);
  /* Name the segment: "MY-NAME" */
  rv = prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME,
             anon_shmem, SIZE, "MY-NAME");

cat /proc//maps (and smaps):
7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 [anon_shmem:MY-NAME]

If the segment is not named, the output is:
7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 /dev/zero (deleted)

Link: https://lkml.kernel.org/r/20221115020602.804224-1-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin 
Acked-by: David Hildenbrand 
Cc: Arnd Bergmann 
Cc: Bagas Sanjaya 
Cc: Colin Cross 
Cc: Hugh Dickins 
Cc: Johannes Weiner 
Cc: Jonathan Corbet 
Cc: "Kirill A . Shutemov" 
Cc: Liam Howlett 
Cc: Matthew Wilcox 
Cc: Mike Rapoport 
Cc: Paul Gortmaker 
Cc: Peter Xu 
Cc: Sean Christopherson 
Cc: Vincent Whitchurch 
Cc: Vlastimil Babka 
Cc: xu xin 
Cc: Yang Shi 
Cc: Yu Zhao 
Signed-off-by: Andrew Morton

mm: /proc/pid/smaps_rollup: fix maple tree search

2022-10-21T04:27:23+00:00

/proc/pid/smaps_rollup showed 0 kB for everything: now find first vma.

Link: https://lkml.kernel.org/r/3011bee7-182-97a2-1083-d5f5b688e54b@google.com
Fixes: c4c84f06285e ("fs/proc/task_mmu: stop using linked list and highest_vm_end")
Signed-off-by: Hugh Dickins 
Reviewed-by: Liam R. Howlett 
Cc: Alexey Dobriyan 
Cc: Matthew Wilcox (Oracle) 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton

fs/proc/task_mmu: stop using linked list and highest_vm_end

2022-09-27T02:46:21+00:00

Remove references to mm_struct linked list and highest_vm_end for when
they are removed

Link: https://lkml.kernel.org/r/20220906194824.2110408-44-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) 
Signed-off-by: Liam R. Howlett 
Tested-by: Yu Zhao 
Cc: Catalin Marinas 
Cc: David Hildenbrand 
Cc: David Howells 
Cc: Davidlohr Bueso 
Cc: SeongJae Park 
Cc: Sven Schnelle 
Cc: Vlastimil Babka 
Cc: Will Deacon 
Signed-off-by: Andrew Morton

mm: remove vmacache

2022-09-27T02:46:18+00:00

By using the maple tree and the maple tree state, the vmacache is no
longer beneficial and is complicating the VMA code.  Remove the vmacache
to reduce the work in keeping it up to date and code complexity.

Link: https://lkml.kernel.org/r/20220906194824.2110408-26-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett 
Acked-by: Vlastimil Babka 
Tested-by: Yu Zhao 
Cc: Catalin Marinas 
Cc: David Hildenbrand 
Cc: David Howells 
Cc: Davidlohr Bueso 
Cc: "Matthew Wilcox (Oracle)" 
Cc: SeongJae Park 
Cc: Sven Schnelle 
Cc: Will Deacon 
Signed-off-by: Andrew Morton

mm/swap: add swp_offset_pfn() to fetch PFN from swap entry

2022-09-27T02:46:05+00:00

We've got a bunch of special swap entries that stores PFN inside the swap
offset fields.  To fetch the PFN, normally the user just calls
swp_offset() assuming that'll be the PFN.

Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
max possible length of a PFN on the host, meanwhile doing proper check
with MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the
PFNs properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().

One reason to do so is we never tried to sanitize whether swap offset can
really fit for storing PFN.  At the meantime, this patch also prepares us
with the future possibility to store more information inside the swp
offset field, so assuming "swp_offset(entry)" to be the PFN will not stand
any more very soon.

Replace many of the swp_offset() callers to use swp_offset_pfn() where
proper.  Note that many of the existing users are not candidates for the
replacement, e.g.:

  (1) When the swap entry is not a pfn swap entry at all, or,
  (2) when we wanna keep the whole swp_offset but only change the swp type.

For the latter, it can happen when fork() triggered on a write-migration
swap entry pte, we may want to only change the migration type from
write->read but keep the rest, so it's not "fetching PFN" but "changing
swap type only".  They're left aside so that when there're more
information within the swp offset they'll be carried over naturally in
those cases.

Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
the new swp_offset_pfn() is about.

Link: https://lkml.kernel.org/r/20220811161331.37055-4-peterx@redhat.com
Signed-off-by: Peter Xu 
Reviewed-by: "Huang, Ying" 
Cc: Alistair Popple 
Cc: Andi Kleen 
Cc: Andrea Arcangeli 
Cc: David Hildenbrand 
Cc: Hugh Dickins 
Cc: "Kirill A . Shutemov" 
Cc: Minchan Kim 
Cc: Nadav Amit 
Cc: Vlastimil Babka 
Cc: Dave Hansen 
Signed-off-by: Andrew Morton

mm/thp: add flag to enforce sysfs THP in hugepage_vma_check()

2022-09-12T03:25:45+00:00

MADV_COLLAPSE is not coupled to the kernel-oriented sysfs THP settings[1].

hugepage_vma_check() is the authority on determining if a VMA is eligible
for THP allocation/collapse, and currently enforces the sysfs THP
settings.  Add a flag to disable these checks.  For now, only apply this
arg to anon and file, which use /sys/kernel/transparent_hugepage/enabled. 
We can expand this to shmem, which uses
/sys/kernel/transparent_hugepage/shmem_enabled, later.

Use this flag in collapse_pte_mapped_thp() where previously the VMA flags
passed to hugepage_vma_check() were OR'd with VM_HUGEPAGE to elide the
VM_HUGEPAGE check in "madvise" THP mode.  Prior to "mm: khugepaged: check
THP flag in hugepage_vma_check()", this check also didn't check "never"
THP mode.  As such, this restores the previous behavior of
collapse_pte_mapped_thp() where sysfs THP settings are ignored.  See
comment in code for justification why this is OK.

[1] https://lore.kernel.org/linux-mm/CAAa6QmQxay1_=Pmt8oCX2-Va18t44FV-Vs-WsQt_6+qBks4nZA@mail.gmail.com/

Link: https://lkml.kernel.org/r/20220706235936.2197195-8-zokeefe@google.com
Signed-off-by: Zach O'Keefe 
Reviewed-by: Yang Shi 
Cc: Alex Shi 
Cc: Andrea Arcangeli 
Cc: Arnd Bergmann 
Cc: Axel Rasmussen 
Cc: Chris Kennelly 
Cc: Chris Zankel 
Cc: David Hildenbrand 
Cc: David Rientjes 
Cc: Helge Deller 
Cc: Hugh Dickins 
Cc: Ivan Kokshaysky 
Cc: James Bottomley 
Cc: Jens Axboe 
Cc: "Kirill A. Shutemov" 
Cc: Matthew Wilcox 
Cc: Matt Turner 
Cc: Max Filippov 
Cc: Miaohe Lin 
Cc: Michal Hocko 
Cc: Minchan Kim 
Cc: Pasha Tatashin 
Cc: Pavel Begunkov 
Cc: Peter Xu 
Cc: Rongwei Wang 
Cc: SeongJae Park 
Cc: Song Liu 
Cc: Thomas Bogendoerfer 
Cc: Vlastimil Babka 
Cc: Zi Yan 
Cc: Dan Carpenter 
Cc: "Souptick Joarder (HPE)" 
Signed-off-by: Andrew Morton

mm/smaps: don't access young/dirty bit if pte unpresent

2022-08-20T22:17:45+00:00

These bits should only be valid when the ptes are present.  Introducing
two booleans for it and set it to false when !pte_present() for both pte
and pmd accountings.

The bug is found during code reading and no real world issue reported, but
logically such an error can cause incorrect readings for either smaps or
smaps_rollup output on quite a few fields.

For example, it could cause over-estimate on values like Shared_Dirty,
Private_Dirty, Referenced.  Or it could also cause under-estimate on
values like LazyFree, Shared_Clean, Private_Clean.

Link: https://lkml.kernel.org/r/20220805160003.58929-1-peterx@redhat.com
Fixes: b1d4d9e0cbd0 ("proc/smaps: carefully handle migration entries")
Fixes: c94b6923fa0a ("/proc/PID/smaps: Add PMD migration entry parsing")
Signed-off-by: Peter Xu 
Reviewed-by: Vlastimil Babka 
Reviewed-by: David Hildenbrand 
Reviewed-by: Yang Shi 
Cc: Konstantin Khlebnikov 
Cc: Huang Ying 
Signed-off-by: Andrew Morton

mm: thp: kill __transhuge_page_enabled()

2022-07-18T00:14:33+00:00

The page fault path checks THP eligibility with __transhuge_page_enabled()
which does the similar thing as hugepage_vma_check(), so use
hugepage_vma_check() instead.

However page fault allows DAX and !anon_vma cases, so added a new flag,
in_pf, to hugepage_vma_check() to make page fault work correctly.

The in_pf flag is also used to skip shmem and file THP for page fault
since shmem handles THP in its own shmem_fault() and file THP allocation
on fault is not supported yet.

Also remove hugepage_vma_enabled() since hugepage_vma_check() is the only
caller now, it is not necessary to have a helper function.

Link: https://lkml.kernel.org/r/20220616174840.1202070-6-shy828301@gmail.com
Signed-off-by: Yang Shi 
Reviewed-by: Zach O'Keefe 
Cc: Kirill A. Shutemov 
Cc: Matthew Wilcox 
Cc: Miaohe Lin 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton