linux-stable.git/mm/memory.c, branch v6.1.2

hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing

2022-11-30T22:49:40+00:00

madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page
tables associated with the address range.  For hugetlb vmas,
zap_page_range will call __unmap_hugepage_range_final.  However,
__unmap_hugepage_range_final assumes the passed vma is about to be removed
and deletes the vma_lock to prevent pmd sharing as the vma is on the way
out.  In the case of madvise(MADV_DONTNEED) the vma remains, but the
missing vma_lock prevents pmd sharing and could potentially lead to issues
with truncation/fault races.

This issue was originally reported here [1] as a BUG triggered in
page_try_dup_anon_rmap.  Prior to the introduction of the hugetlb
vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
prevent pmd sharing.  Subsequent faults on this vma were confused as
VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was
not set in new pages added to the page table.  This resulted in pages that
appeared anonymous in a VM_SHARED vma and triggered the BUG.

Address issue by adding a new zap flag ZAP_FLAG_UNMAP to indicate an unmap
call from unmap_vmas().  This is used to indicate the 'final' unmapping of
a hugetlb vma.  When called via MADV_DONTNEED, this flag is not set and
the vm_lock is not deleted.

[1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/

Link: https://lkml.kernel.org/r/20221114235507.294320-3-mike.kravetz@oracle.com
Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
Signed-off-by: Mike Kravetz 
Reported-by: Wei Chen 
Cc: Axel Rasmussen 
Cc: David Hildenbrand 
Cc: Matthew Wilcox 
Cc: Mina Almasry 
Cc: Nadav Amit 
Cc: Naoya Horiguchi 
Cc: Peter Xu 
Cc: Rik van Riel 
Cc: Vlastimil Babka 
Cc: 
Signed-off-by: Andrew Morton

madvise: use zap_page_range_single for madvise dontneed

2022-11-30T22:49:40+00:00

This series addresses the issue first reported in [1], and fully described
in patch 2.  Patches 1 and 2 address the user visible issue and are tagged
for stable backports.

While exploring solutions to this issue, related problems with mmu
notification calls were discovered.  This is addressed in the patch
"hugetlb: remove duplicate mmu notifications:".  Since there are no user
visible effects, this third is not tagged for stable backports.

Previous discussions suggested further cleanup by removing the
routine zap_page_range.  This is possible because zap_page_range_single
is now exported, and all callers of zap_page_range pass ranges entirely
within a single vma.  This work will be done in a later patch so as not
to distract from this bug fix.

[1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/


This patch (of 2):

Expose the routine zap_page_range_single to zap a range within a single
vma.  The madvise routine madvise_dontneed_single_vma can use this routine
as it explicitly operates on a single vma.  Also, update the mmu
notification range in zap_page_range_single to take hugetlb pmd sharing
into account.  This is required as MADV_DONTNEED supports hugetlb vmas.

Link: https://lkml.kernel.org/r/20221114235507.294320-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20221114235507.294320-2-mike.kravetz@oracle.com
Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
Signed-off-by: Mike Kravetz 
Reported-by: Wei Chen 
Cc: Axel Rasmussen 
Cc: David Hildenbrand 
Cc: Matthew Wilcox 
Cc: Mina Almasry 
Cc: Nadav Amit 
Cc: Naoya Horiguchi 
Cc: Peter Xu 
Cc: Rik van Riel 
Cc: Vlastimil Babka 
Cc: 
Signed-off-by: Andrew Morton

mm/memory: return vm_fault_t result from migrate_to_ram() callback

2022-11-23T02:50:42+00:00

The migrate_to_ram() callback should always succeed, but in rare cases can
fail usually returning VM_FAULT_SIGBUS.  Commit 16ce101db85d
("mm/memory.c: fix race when faulting a device private page") incorrectly
stopped passing the return code up the stack.  Fix this by setting the ret
variable, restoring the previous behaviour on migrate_to_ram() failure.

Link: https://lkml.kernel.org/r/20221114115537.727371-1-apopple@nvidia.com
Fixes: 16ce101db85d ("mm/memory.c: fix race when faulting a device private page")
Signed-off-by: Alistair Popple 
Acked-by: David Hildenbrand 
Reviewed-by: Felix Kuehling 
Cc: Ralph Campbell 
Cc: John Hubbard 
Cc: Alex Sierra 
Cc: Ben Skeggs 
Cc: Lyude Paul 
Cc: Jason Gunthorpe 
Cc: Michael Ellerman 
Signed-off-by: Andrew Morton

mm: use update_mmu_tlb() on the second thread

2022-10-13T01:51:50+00:00

As message in commit 7df676974359 ("mm/memory.c: Update local TLB if PTE
entry exists") said, we should update local TLB only on the second thread.
So in the do_anonymous_page() here, we should use update_mmu_tlb()
instead of update_mmu_cache() on the second thread.

As David pointed out, this is a performance improvement, not a
correctness fix.

Link: https://lkml.kernel.org/r/20220929112318.32393-2-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng 
Reviewed-by: Muchun Song 
Acked-by: David Hildenbrand 
Cc: Bibo Mao 
Cc: Chris Zankel 
Cc: Huacai Chen 
Cc: Max Filippov 
Signed-off-by: Andrew Morton

mm/memory.c: fix race when faulting a device private page

2022-10-13T01:51:49+00:00

Patch series "Fix several device private page reference counting issues",
v2

This series aims to fix a number of page reference counting issues in
drivers dealing with device private ZONE_DEVICE pages.  These result in
use-after-free type bugs, either from accessing a struct page which no
longer exists because it has been removed or accessing fields within the
struct page which are no longer valid because the page has been freed.

During normal usage it is unlikely these will cause any problems.  However
without these fixes it is possible to crash the kernel from userspace. 
These crashes can be triggered either by unloading the kernel module or
unbinding the device from the driver prior to a userspace task exiting. 
In modules such as Nouveau it is also possible to trigger some of these
issues by explicitly closing the device file-descriptor prior to the task
exiting and then accessing device private memory.

This involves some minor changes to both PowerPC and AMD GPU code. 
Unfortunately I lack hardware to test either of those so any help there
would be appreciated.  The changes mimic what is done in for both Nouveau
and hmm-tests though so I doubt they will cause problems.


This patch (of 8):

When the CPU tries to access a device private page the migrate_to_ram()
callback associated with the pgmap for the page is called.  However no
reference is taken on the faulting page.  Therefore a concurrent migration
of the device private page can free the page and possibly the underlying
pgmap.  This results in a race which can crash the kernel due to the
migrate_to_ram() function pointer becoming invalid.  It also means drivers
can't reliably read the zone_device_data field because the page may have
been freed with memunmap_pages().

Close the race by getting a reference on the page while holding the ptl to
ensure it has not been freed.  Unfortunately the elevated reference count
will cause the migration required to handle the fault to fail.  To avoid
this failure pass the faulting page into the migrate_vma functions so that
if an elevated reference count is found it can be checked to see if it's
expected or not.

[mpe@ellerman.id.au: fix build]
  Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au
Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com
Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple 
Acked-by: Felix Kuehling 
Cc: Jason Gunthorpe 
Cc: John Hubbard 
Cc: Ralph Campbell 
Cc: Michael Ellerman 
Cc: Lyude Paul 
Cc: Alex Deucher 
Cc: Alex Sierra 
Cc: Ben Skeggs 
Cc: Christian König 
Cc: Dan Williams 
Cc: David Hildenbrand 
Cc: "Huang, Ying" 
Cc: Matthew Wilcox 
Cc: Yang Shi 
Cc: Zi Yan 
Signed-off-by: Andrew Morton

mm/uffd: fix warning without PTE_MARKER_UFFD_WP compiled in

2022-10-12T22:56:46+00:00

When PTE_MARKER_UFFD_WP not configured, it's still possible to reach pte
marker code and trigger an warning. Add a few CONFIG_PTE_MARKER_UFFD_WP
ifdefs to make sure the code won't be reached when not compiled in.

Link: https://lkml.kernel.org/r/YzeR+R6b4bwBlBHh@x1n
Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
Signed-off-by: Peter Xu 
Reported-by: 
Cc: Axel Rasmussen 
Cc: Brian Geffon 
Cc: Edward Liaw 
Cc: Liu Shixin 
Cc: Mike Kravetz 
Cc: 
Signed-off-by: Andrew Morton

hugetlb: fix vma lock handling during split vma and range unmapping

2022-10-07T21:28:40+00:00

Patch series "hugetlb: fixes for new vma lock series".

In review of the series "hugetlb: Use new vma lock for huge pmd sharing
synchronization", Miaohe Lin pointed out two key issues:

1) There is a race in the routine hugetlb_unmap_file_folio when locks
   are dropped and reacquired in the correct order [1].

2) With the switch to using vma lock for fault/truncate synchronization,
   we need to make sure lock exists for all VM_MAYSHARE vmas, not just
   vmas capable of pmd sharing.

These two issues are addressed here.  In addition, having a vma lock
present in all VM_MAYSHARE vmas, uncovered some issues around vma
splitting.  Those are also addressed.

[1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/


This patch (of 3):

The hugetlb vma lock hangs off the vm_private_data field and is specific
to the vma.  When vm_area_dup() is called as part of vma splitting, the
vma lock pointer is copied to the new vma.  This will result in issues
such as double freeing of the structure.  Update the hugetlb open vm_ops
to allocate a new vma lock for the new vma.

The routine __unmap_hugepage_range_final unconditionally unset VM_MAYSHARE
to prevent subsequent pmd sharing.  hugetlb_vma_lock_free attempted to
anticipate this by checking both VM_MAYSHARE and VM_SHARED.  However, if
only VM_MAYSHARE was set we would miss the free.  With the introduction of
the vma lock, a vma can not participate in pmd sharing if vm_private_data
is NULL.  Instead of clearing VM_MAYSHARE in __unmap_hugepage_range_final,
free the vma lock to prevent sharing.  Also, update the sharing code to
make sure vma lock is indeed a condition for pmd sharing. 
hugetlb_vma_lock_free can then key off VM_MAYSHARE and not miss any vmas.

Link: https://lkml.kernel.org/r/20221005011707.514612-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20221005011707.514612-2-mike.kravetz@oracle.com
Fixes: "hugetlb: add vma based lock for pmd sharing"
Signed-off-by: Mike Kravetz 
Cc: Andrea Arcangeli 
Cc: "Aneesh Kumar K.V" 
Cc: Axel Rasmussen 
Cc: David Hildenbrand 
Cc: Davidlohr Bueso 
Cc: James Houghton 
Cc: "Kirill A. Shutemov" 
Cc: Miaohe Lin 
Cc: Michal Hocko 
Cc: Mina Almasry 
Cc: Muchun Song 
Cc: Naoya Horiguchi 
Cc: Pasha Tatashin 
Cc: Peter Xu 
Cc: Prakash Sangappa 
Cc: Sven Schnelle 
Signed-off-by: Andrew Morton

mm: kmsan: maintain KMSAN metadata for page operations

2022-10-03T21:03:20+00:00

Insert KMSAN hooks that make the necessary bookkeeping changes:
 - poison page shadow and origins in alloc_pages()/free_page();
 - clear page shadow and origins in clear_page(), copy_user_highpage();
 - copy page metadata in copy_highpage(), wp_page_copy();
 - handle vmap()/vunmap()/iounmap();

Link: https://lkml.kernel.org/r/20220915150417.722975-15-glider@google.com
Signed-off-by: Alexander Potapenko 
Cc: Alexander Viro 
Cc: Alexei Starovoitov 
Cc: Andrey Konovalov 
Cc: Andrey Konovalov 
Cc: Andy Lutomirski 
Cc: Arnd Bergmann 
Cc: Borislav Petkov 
Cc: Christoph Hellwig 
Cc: Christoph Lameter 
Cc: David Rientjes 
Cc: Dmitry Vyukov 
Cc: Eric Biggers 
Cc: Eric Biggers 
Cc: Eric Dumazet 
Cc: Greg Kroah-Hartman 
Cc: Herbert Xu 
Cc: Ilya Leoshkevich 
Cc: Ingo Molnar 
Cc: Jens Axboe 
Cc: Joonsoo Kim 
Cc: Kees Cook 
Cc: Marco Elver 
Cc: Mark Rutland 
Cc: Matthew Wilcox 
Cc: Michael S. Tsirkin 
Cc: Pekka Enberg 
Cc: Peter Zijlstra 
Cc: Petr Mladek 
Cc: Stephen Rothwell 
Cc: Steven Rostedt 
Cc: Thomas Gleixner 
Cc: Vasily Gorbik 
Cc: Vegard Nossum 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton

hugetlb: use new vma_lock for pmd sharing synchronization

2022-10-03T21:03:17+00:00

The new hugetlb vma lock is used to address this race:

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   <------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse                        i_mmap_unlock_write
...

The vma_lock is used as follows:
- During fault processing. The lock is acquired in read mode before
  doing a page table lock and allocation (huge_pte_alloc).  The lock is
  held until code is finished with the page table entry (ptep).
- The lock must be held in write mode whenever huge_pmd_unshare is
  called.

Lock ordering issues come into play when unmapping a page from all
vmas mapping the page.  The i_mmap_rwsem must be held to search for the
vmas, and the vma lock must be held before calling unmap which will
call huge_pmd_unshare.  This is done today in:
- try_to_migrate_one and try_to_unmap_ for page migration and memory
  error handling.  In these routines we 'try' to obtain the vma lock and
  fail to unmap if unsuccessful.  Calling routines already deal with the
  failure of unmapping.
- hugetlb_vmdelete_list for truncation and hole punch.  This routine
  also tries to acquire the vma lock.  If it fails, it skips the
  unmapping.  However, we can not have file truncation or hole punch
  fail because of contention.  After hugetlb_vmdelete_list, truncation
  and hole punch call remove_inode_hugepages.  remove_inode_hugepages
  checks for mapped pages and call hugetlb_unmap_file_page to unmap them.
  hugetlb_unmap_file_page is designed to drop locks and reacquire in the
  correct order to guarantee unmap success.

Link: https://lkml.kernel.org/r/20220914221810.95771-9-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz 
Cc: Andrea Arcangeli 
Cc: "Aneesh Kumar K.V" 
Cc: Axel Rasmussen 
Cc: David Hildenbrand 
Cc: Davidlohr Bueso 
Cc: James Houghton 
Cc: "Kirill A. Shutemov" 
Cc: Miaohe Lin 
Cc: Michal Hocko 
Cc: Mina Almasry 
Cc: Muchun Song 
Cc: Naoya Horiguchi 
Cc: Pasha Tatashin 
Cc: Peter Xu 
Cc: Prakash Sangappa 
Cc: Sven Schnelle 
Signed-off-by: Andrew Morton

mm: use nth_page instead of mem_map_offset mem_map_next

2022-10-03T21:03:08+00:00

To handle the discontiguous case, mem_map_next() has a parameter named
`offset`.  As a function caller, one would be confused why "get next
entry" needs a parameter named "offset".  The other drawback of
mem_map_next() is that the callers must take care of the map between
parameter "iter" and "offset", otherwise we may get an hole or duplication
during iteration.  So we use nth_page instead of mem_map_next.

And replace mem_map_offset with nth_page() per Matthew's comments.

Link: https://lkml.kernel.org/r/1662708669-9395-1-git-send-email-lic121@chinatelecom.cn
Signed-off-by: Cheng Li 
Fixes: 69d177c2fc70 ("hugetlbfs: handle pages higher order than MAX_ORDER")
Reviewed-by: Matthew Wilcox (Oracle) 
Cc: Mike Kravetz 
Signed-off-by: Andrew Morton