linux-stable.git/mm/page_alloc.c, branch v5.18.3

mm/page_alloc: always attempt to allocate at least one page during bulk allocation

2022-06-09T08:30:52+00:00

commit c572e4888ad1be123c1516ec577ad30a700bbec4 upstream.

Peter Pavlisko reported the following problem on kernel bugzilla 216007.

	When I try to extract an uncompressed tar archive (2.6 milion
	files, 760.3 GiB in size) on newly created (empty) XFS file system,
	after first low tens of gigabytes extracted the process hangs in
	iowait indefinitely. One CPU core is 100% occupied with iowait,
	the other CPU core is idle (on 2-core Intel Celeron G1610T).

It was bisected to c9fa563072e1 ("xfs: use alloc_pages_bulk_array() for
buffers") but XFS is only the messenger.  The problem is that nothing is
waking kswapd to reclaim some pages at a time the PCP lists cannot be
refilled until some reclaim happens.  The bulk allocator checks that there
are some pages in the array and the original intent was that a bulk
allocator did not necessarily need all the requested pages and it was best
to return as quickly as possible.

This was fine for the first user of the API but both NFS and XFS require
the requested number of pages be available before making progress.  Both
could be adjusted to call the page allocator directly if a bulk allocation
fails but it puts a burden on users of the API.  Adjust the semantics to
attempt at least one allocation via __alloc_pages() before returning so
kswapd is woken if necessary.

It was reported via bugzilla that the patch addressed the problem and that
the tar extraction completed successfully.  This may also address bug
215975 but has yet to be confirmed.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=216007
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215975
Link: https://lkml.kernel.org/r/20220526091210.GC3441@techsingularity.net
Fixes: 387ba26fb1cb ("mm/page_alloc: add a bulk page allocator")
Signed-off-by: Mel Gorman 
Cc: "Darrick J. Wong" 
Cc: Dave Chinner 
Cc: Jan Kara 
Cc: Vlastimil Babka 
Cc: Jesper Dangaard Brouer 
Cc: Chuck Lever 
Cc: 	[5.13+]
Signed-off-by: Andrew Morton 
Signed-off-by: Greg Kroah-Hartman

page_alloc: use vmalloc_huge for large system hash

2022-04-24T17:00:54+00:00

Use vmalloc_huge() in alloc_large_system_hash() so that large system
hash (>= PMD_SIZE) could benefit from huge pages.

Note that vmalloc_huge only allocates huge pages for systems with
HAVE_ARCH_HUGE_VMALLOC.

Signed-off-by: Song Liu 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Rik van Riel 
Signed-off-by: Linus Torvalds

mm, page_alloc: fix build_zonerefs_node()

2022-04-15T21:49:55+00:00

Since commit 6aa303defb74 ("mm, vmscan: only allocate and reclaim from
zones with pages managed by the buddy allocator") only zones with free
memory are included in a built zonelist.  This is problematic when e.g.
all memory of a zone has been ballooned out when zonelists are being
rebuilt.

The decision whether to rebuild the zonelists when onlining new memory
is done based on populated_zone() returning 0 for the zone the memory
will be added to.  The new zone is added to the zonelists only, if it
has free memory pages (managed_zone() returns a non-zero value) after
the memory has been onlined.  This implies, that onlining memory will
always free the added pages to the allocator immediately, but this is
not true in all cases: when e.g. running as a Xen guest the onlined new
memory will be added only to the ballooned memory list, it will be freed
only when the guest is being ballooned up afterwards.

Another problem with using managed_zone() for the decision whether a
zone is being added to the zonelists is, that a zone with all memory
used will in fact be removed from all zonelists in case the zonelists
happen to be rebuilt.

Use populated_zone() when building a zonelist as it has been done before
that commit.

There was a report that QubesOS (based on Xen) is hitting this problem.
Xen has switched to use the zone device functionality in kernel 5.9 and
QubesOS wants to use memory hotplugging for guests in order to be able
to start a guest with minimal memory and expand it as needed.  This was
the report leading to the patch.

Link: https://lkml.kernel.org/r/20220407120637.9035-1-jgross@suse.com
Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
Signed-off-by: Juergen Gross 
Reported-by: Marek Marczykowski-Górecki 
Acked-by: Michal Hocko 
Acked-by: David Hildenbrand 
Cc: Marek Marczykowski-Górecki 
Reviewed-by: Wei Yang 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Revert "mm/page_alloc: mark pagesets as __maybe_unused"

2022-04-05T07:59:39+00:00

The local_lock() is now using a proper static inline function which is
enough for llvm to accept that the variable is used.

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lore.kernel.org/r/20220328145810.86783-4-bigeasy@linutronix.de

mm/munlock: protect the per-CPU pagevec by a local_lock_t

2022-04-01T18:46:09+00:00

The access to mlock_pvec is protected by disabling preemption via
get_cpu_var() or implicit by having preemption disabled by the caller
(in mlock_page_drain() case).  This breaks on PREEMPT_RT since
folio_lruvec_lock_irq() acquires a sleeping lock in this section.

Create struct mlock_pvec which consits of the local_lock_t and the
pagevec.  Acquire the local_lock() before accessing the per-CPU pagevec.
Replace mlock_page_drain() with a _local() version which is invoked on
the local CPU and acquires the local_lock_t and a _remote() version
which uses the pagevec from a remote CPU which offline.

Link: https://lkml.kernel.org/r/YjizWi9IY0mpvIfb@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior 
Acked-by: Hugh Dickins 
Cc: Vlastimil Babka 
Cc: Matthew Wilcox 
Cc: Thomas Gleixner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: page_alloc: validate buddy before check its migratetype.

2022-03-30T22:45:43+00:00

Whenever a buddy page is found, page_is_buddy() should be called to
check its validity.  Add the missing check during pageblock merge check.

Fixes: 1dd214b8f21c ("mm: page_alloc: avoid merging non-fallbackable pageblocks with others")
Link: https://lore.kernel.org/all/20220330154208.71aca532@gandalf.local.home/
Reported-and-tested-by: Steven Rostedt 
Signed-off-by: Zi Yan 
Signed-off-by: Linus Torvalds

kasan, page_alloc: allow skipping memory init for HW_TAGS

2022-03-25T02:06:47+00:00

Add a new GFP flag __GFP_SKIP_ZERO that allows to skip memory
initialization.  The flag is only effective with HW_TAGS KASAN.

This flag will be used by vmalloc code for page_alloc allocations backing
vmalloc() mappings in a following patch.  The reason to skip memory
initialization for these pages in page_alloc is because vmalloc code will
be initializing them instead.

With the current implementation, when __GFP_SKIP_ZERO is provided,
__GFP_ZEROTAGS is ignored.  This doesn't matter, as these two flags are
never provided at the same time.  However, if this is changed in the
future, this particular implementation detail can be changed as well.

Link: https://lkml.kernel.org/r/0d53efeff345de7d708e0baa0d8829167772521e.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov 
Acked-by: Marco Elver 
Cc: Alexander Potapenko 
Cc: Andrey Ryabinin 
Cc: Catalin Marinas 
Cc: Dmitry Vyukov 
Cc: Evgenii Stepanov 
Cc: Mark Rutland 
Cc: Peter Collingbourne 
Cc: Vincenzo Frascino 
Cc: Will Deacon 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

kasan, page_alloc: allow skipping unpoisoning for HW_TAGS

2022-03-25T02:06:47+00:00

Add a new GFP flag __GFP_SKIP_KASAN_UNPOISON that allows skipping KASAN
poisoning for page_alloc allocations.  The flag is only effective with
HW_TAGS KASAN.

This flag will be used by vmalloc code for page_alloc allocations backing
vmalloc() mappings in a following patch.  The reason to skip KASAN
poisoning for these pages in page_alloc is because vmalloc code will be
poisoning them instead.

Also reword the comment for __GFP_SKIP_KASAN_POISON.

Link: https://lkml.kernel.org/r/35c97d77a704f6ff971dd3bfe4be95855744108e.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov 
Acked-by: Marco Elver 
Cc: Alexander Potapenko 
Cc: Andrey Ryabinin 
Cc: Catalin Marinas 
Cc: Dmitry Vyukov 
Cc: Evgenii Stepanov 
Cc: Mark Rutland 
Cc: Peter Collingbourne 
Cc: Vincenzo Frascino 
Cc: Will Deacon 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

kasan, page_alloc: rework kasan_unpoison_pages call site

2022-03-25T02:06:46+00:00

Rework the checks around kasan_unpoison_pages() call in post_alloc_hook().

The logical condition for calling this function is:

 - If a software KASAN mode is enabled, we need to mark shadow memory.

 - Otherwise, HW_TAGS KASAN is enabled, and it only makes sense to set
   tags if they haven't already been cleared by tag_clear_highpage(),
   which is indicated by init_tags.

This patch concludes the changes for post_alloc_hook().

Link: https://lkml.kernel.org/r/0ecebd0d7ccd79150e3620ea4185a32d3dfe912f.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov 
Acked-by: Marco Elver 
Cc: Alexander Potapenko 
Cc: Andrey Ryabinin 
Cc: Catalin Marinas 
Cc: Dmitry Vyukov 
Cc: Evgenii Stepanov 
Cc: Mark Rutland 
Cc: Peter Collingbourne 
Cc: Vincenzo Frascino 
Cc: Will Deacon 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

kasan, page_alloc: move kernel_init_free_pages in post_alloc_hook

2022-03-25T02:06:46+00:00

Pull the kernel_init_free_pages() call in post_alloc_hook() out of the big
if clause for better code readability.  This also allows for more
simplifications in the following patch.

This patch does no functional changes.

Link: https://lkml.kernel.org/r/a7a76456501eb37ddf9fca6529cee9555e59cdb1.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov 
Reviewed-by: Alexander Potapenko 
Acked-by: Marco Elver 
Cc: Andrey Ryabinin 
Cc: Catalin Marinas 
Cc: Dmitry Vyukov 
Cc: Evgenii Stepanov 
Cc: Mark Rutland 
Cc: Peter Collingbourne 
Cc: Vincenzo Frascino 
Cc: Will Deacon 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds