linux.git/mm, branch v5.13

mm/page_alloc: do bulk array bounds check after checking populated elements

2021-06-25T02:40:54+00:00

Dan Carpenter reported the following

  The patch 0f87d9d30f21: "mm/page_alloc: add an array-based interface
  to the bulk page allocator" from Apr 29, 2021, leads to the following
  static checker warning:

        mm/page_alloc.c:5338 __alloc_pages_bulk()
        warn: potentially one past the end of array 'page_array[nr_populated]'

The problem can occur if an array is passed in that is fully populated.
That potentially ends up allocating a single page and storing it past
the end of the array.  This patch returns 0 if the array is fully
populated.

Link: https://lkml.kernel.org/r/20210618125102.GU30378@techsingularity.net
Fixes: 0f87d9d30f21 ("mm/page_alloc: add an array-based interface to the bulk page allocator")
Signed-off-by: Mel Gorman 
Reported-by: Dan Carpenter 
Cc: Jesper Dangaard Brouer 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/page_alloc: __alloc_pages_bulk(): do bounds check before accessing array

2021-06-25T02:40:54+00:00

In the event that somebody would call this with an already fully
populated page_array, the last loop iteration would do an access beyond
the end of page_array.

It's of course extremely unlikely that would ever be done, but this
triggers my internal static analyzer.  Also, if it really is not
supposed to be invoked this way (i.e., with no NULL entries in
page_array), the nr_populated
Acked-by: Mel Gorman 
Reviewed-by: Andrew Morton 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/hwpoison: do not lock page again when me_huge_page() successfully recovers

2021-06-25T02:40:54+00:00

Currently me_huge_page() temporary unlocks page to perform some actions
then locks it again later.  My testcase (which calls hard-offline on
some tail page in a hugetlb, then accesses the address of the hugetlb
range) showed that page allocation code detects this page lock on buddy
page and printed out "BUG: Bad page state" message.

check_new_page_bad() does not consider a page with __PG_HWPOISON as bad
page, so this flag works as kind of filter, but this filtering doesn't
work in this case because the "bad page" is not the actual hwpoisoned
page.  So stop locking page again.  Actions to be taken depend on the
page type of the error, so page unlocking should be done in ->action()
callbacks.  So let's make it assumed and change all existing callbacks
that way.

Link: https://lkml.kernel.org/r/20210609072029.74645-1-nao.horiguchi@gmail.com
Fixes: commit 78bb920344b8 ("mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error")
Signed-off-by: Naoya Horiguchi 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Tony Luck 
Cc: "Aneesh Kumar K.V" 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm,hwpoison: return -EHWPOISON to denote that the page has already been poisoned

2021-06-25T02:40:54+00:00

When memory_failure() is called with MF_ACTION_REQUIRED on the page that
has already been hwpoisoned, memory_failure() could fail to send SIGBUS
to the affected process, which results in infinite loop of MCEs.

Currently memory_failure() returns 0 if it's called for already
hwpoisoned page, then the caller, kill_me_maybe(), could return without
sending SIGBUS to current process.  An action required MCE is raised
when the current process accesses to the broken memory, so no SIGBUS
means that the current process continues to run and access to the error
page again soon, so running into MCE loop.

This issue can arise for example in the following scenarios:

 - Two or more threads access to the poisoned page concurrently. If
   local MCE is enabled, MCE handler independently handles the MCE
   events. So there's a race among MCE events, and the second or latter
   threads fall into the situation in question.

 - If there was a precedent memory error event and memory_failure() for
   the event failed to unmap the error page for some reason, the
   subsequent memory access to the error page triggers the MCE loop
   situation.

To fix the issue, make memory_failure() return an error code when the
error page has already been hwpoisoned.  This allows memory error
handler to control how it sends signals to userspace.  And make sure
that any process touching a hwpoisoned page should get a SIGBUS even in
"already hwpoisoned" path of memory_failure() as is done in page fault
path.

Link: https://lkml.kernel.org/r/20210521030156.2612074-3-nao.horiguchi@gmail.com
Signed-off-by: Aili Yao 
Signed-off-by: Naoya Horiguchi 
Reviewed-by: Oscar Salvador 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Borislav Petkov 
Cc: David Hildenbrand 
Cc: Jue Wang 
Cc: Tony Luck 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory-failure: use a mutex to avoid memory_failure() races

2021-06-25T02:40:54+00:00

Patch series "mm,hwpoison: fix sending SIGBUS for Action Required MCE", v5.

I wrote this patchset to materialize what I think is the current
allowable solution mentioned by the previous discussion [1].  I simply
borrowed Tony's mutex patch and Aili's return code patch, then I queued
another one to find error virtual address in the best effort manner.  I
know that this is not a perfect solution, but should work for some
typical case.

[1]: https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machine/

This patch (of 2):

There can be races when multiple CPUs consume poison from the same page.
The first into memory_failure() atomically sets the HWPoison page flag
and begins hunting for tasks that map this page.  Eventually it
invalidates those mappings and may send a SIGBUS to the affected tasks.

But while all that work is going on, other CPUs see a "success" return
code from memory_failure() and so they believe the error has been
handled and continue executing.

Fix by wrapping most of the internal parts of memory_failure() in a
mutex.

[akpm@linux-foundation.org: make mf_mutex local to memory_failure()]

Link: https://lkml.kernel.org/r/20210521030156.2612074-1-nao.horiguchi@gmail.com
Link: https://lkml.kernel.org/r/20210521030156.2612074-2-nao.horiguchi@gmail.com
Signed-off-by: Tony Luck 
Signed-off-by: Naoya Horiguchi 
Reviewed-by: Borislav Petkov 
Reviewed-by: Oscar Salvador 
Cc: Aili Yao 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: David Hildenbrand 
Cc: Jue Wang 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, futex: fix shared futex pgoff on shmem huge page

2021-06-25T02:40:54+00:00

If more than one futex is placed on a shmem huge page, it can happen
that waking the second wakes the first instead, and leaves the second
waiting: the key's shared.pgoff is wrong.

When 3.11 commit 13d60f4b6ab5 ("futex: Take hugepages into account when
generating futex_key"), the only shared huge pages came from hugetlbfs,
and the code added to deal with its exceptional page->index was put into
hugetlb source.  Then that was missed when 4.8 added shmem huge pages.

page_to_pgoff() is what others use for this nowadays: except that, as
currently written, it gives the right answer on hugetlbfs head, but
nonsense on hugetlbfs tails.  Fix that by calling hugetlbfs-specific
hugetlb_basepage_index() on PageHuge tails as well as on head.

Yes, it's unconventional to declare hugetlb_basepage_index() there in
pagemap.h, rather than in hugetlb.h; but I do not expect anything but
page_to_pgoff() ever to need it.

[akpm@linux-foundation.org: give hugetlb_basepage_index() prototype the correct scope]

Link: https://lkml.kernel.org/r/b17d946b-d09-326e-b42a-52884c36df32@google.com
Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Reported-by: Neel Natu 
Signed-off-by: Hugh Dickins 
Reviewed-by: Matthew Wilcox (Oracle) 
Acked-by: Thomas Gleixner 
Cc: "Kirill A. Shutemov" 
Cc: Zhang Yi 
Cc: Mel Gorman 
Cc: Mike Kravetz 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Darren Hart 
Cc: Davidlohr Bueso 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/vmalloc: unbreak kasan vmalloc support

2021-06-25T02:40:54+00:00

In commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"),
__vmalloc_node_range was changed such that __get_vm_area_node was no
longer called with the requested/real size of the vmalloc allocation,
but rather with a rounded-up size.

This means that __get_vm_area_node called kasan_unpoision_vmalloc() with
a rounded up size rather than the real size.  This led to it allowing
access to too much memory and so missing vmalloc OOBs and failing the
kasan kunit tests.

Pass the real size and the desired shift into __get_vm_area_node.  This
allows it to round up the size for the underlying allocators while still
unpoisioning the correct quantity of shadow memory.

Adjust the other call-sites to pass in PAGE_SHIFT for the shift value.

Link: https://lkml.kernel.org/r/20210617081330.98629-1-dja@axtens.net
Link: https://bugzilla.kernel.org/show_bug.cgi?id=213335
Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
Signed-off-by: Daniel Axtens 
Tested-by: David Gow 
Reviewed-by: Nicholas Piggin 
Reviewed-by: Uladzislau Rezki (Sony) 
Tested-by: Andrey Konovalov 
Acked-by: Andrey Konovalov 
Cc: Dmitry Vyukov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/vmalloc: add vmalloc_no_huge

2021-06-25T02:40:53+00:00

Patch series "mm: add vmalloc_no_huge and use it", v4.

Add vmalloc_no_huge() and export it, so modules can allocate memory with
small pages.

Use the newly added vmalloc_no_huge() in KVM on s390 to get around a
hardware limitation.

This patch (of 2):

Commit 121e6f3258fe3 ("mm/vmalloc: hugepage vmalloc mappings") added
support for hugepage vmalloc mappings, it also added the flag
VM_NO_HUGE_VMAP for __vmalloc_node_range to request the allocation to be
performed with 0-order non-huge pages.

This flag is not accessible when calling vmalloc, the only option is to
call directly __vmalloc_node_range, which is not exported.

This means that a module can't vmalloc memory with small pages.

Case in point: KVM on s390x needs to vmalloc a large area, and it needs
to be mapped with non-huge pages, because of a hardware limitation.

This patch adds the function vmalloc_no_huge, which works like vmalloc,
but it is guaranteed to always back the mapping using small pages.  This
new function is exported, therefore it is usable by modules.

[akpm@linux-foundation.org: whitespace fixes, per Christoph]

Link: https://lkml.kernel.org/r/20210614132357.10202-1-imbrenda@linux.ibm.com
Link: https://lkml.kernel.org/r/20210614132357.10202-2-imbrenda@linux.ibm.com
Fixes: 121e6f3258fe3 ("mm/vmalloc: hugepage vmalloc mappings")
Signed-off-by: Claudio Imbrenda 
Reviewed-by: Uladzislau Rezki (Sony) 
Acked-by: Nicholas Piggin 
Reviewed-by: David Hildenbrand 
Acked-by: David Rientjes 
Cc: Uladzislau Rezki (Sony) 
Cc: Catalin Marinas 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Christoph Hellwig 
Cc: Cornelia Huck 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk()

2021-06-25T02:40:53+00:00

Aha! Shouldn't that quick scan over pte_none()s make sure that it holds
ptlock in the PVMW_SYNC case? That too might have been responsible for
BUGs or WARNs in split_huge_page_to_list() or its unmap_page(), though
I've never seen any.

Link: https://lkml.kernel.org/r/1bdf384c-8137-a149-2a1e-475a4791c3c@google.com
Link: https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/
Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
Signed-off-by: Hugh Dickins 
Acked-by: Kirill A. Shutemov 
Tested-by: Wang Yugui 
Cc: Alistair Popple 
Cc: Matthew Wilcox 
Cc: Peter Xu 
Cc: Ralph Campbell 
Cc: Will Deacon 
Cc: Yang Shi 
Cc: Zi Yan 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes

2021-06-25T02:40:53+00:00

Running certain tests with a DEBUG_VM kernel would crash within hours,
on the total_mapcount BUG() in split_huge_page_to_list(), while trying
to free up some memory by punching a hole in a shmem huge page: split's
try_to_unmap() was unable to find all the mappings of the page (which,
on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).

Crash dumps showed two tail pages of a shmem huge page remained mapped
by pte: ptes in a non-huge-aligned vma of a gVisor process, at the end
of a long unmapped range; and no page table had yet been allocated for
the head of the huge page to be mapped into.

Although designed to handle these odd misaligned huge-page-mapped-by-pte
cases, page_vma_mapped_walk() falls short by returning false prematurely
when !pmd_present or !pud_present or !p4d_present or !pgd_present: there
are cases when a huge page may span the boundary, with ptes present in
the next.

Restructure page_vma_mapped_walk() as a loop to continue in these cases,
while keeping its layout much as before.  Add a step_forward() helper to
advance pvmw->address across those boundaries: originally I tried to use
mm's standard p?d_addr_end() macros, but hit the same crash 512 times
less often: because of the way redundant levels are folded together, but
folded differently in different configurations, it was just too
difficult to use them correctly; and step_forward() is simpler anyway.

Link: https://lkml.kernel.org/r/fedb8632-1798-de42-f39e-873551d5bc81@google.com
Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
Signed-off-by: Hugh Dickins 
Acked-by: Kirill A. Shutemov 
Cc: Alistair Popple 
Cc: Matthew Wilcox 
Cc: Peter Xu 
Cc: Ralph Campbell 
Cc: Wang Yugui 
Cc: Will Deacon 
Cc: Yang Shi 
Cc: Zi Yan 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds