linux.git/mm/page_alloc.c, branch v5.7-rc2

mm/page_alloc: make pcpu_drain_mutex and pcpu_drain static

2020-04-10T22:36:21+00:00

Fix the following sparse warning:

  mm/page_alloc.c:106:1: warning: symbol 'pcpu_drain_mutex' was not declared. Should it be static?
  mm/page_alloc.c:107:1: warning: symbol '__pcpu_scope_pcpu_drain' was not declared. Should it be static?

Reported-by: Hulk Robot 
Signed-off-by: Jason Yan 
Signed-off-by: Andrew Morton 
Link: http://lkml.kernel.org/r/20200407023925.46438-1-yanaijie@huawei.com
Signed-off-by: Linus Torvalds

mm/page_alloc.c: fix kernel-doc warning

2020-04-10T22:36:20+00:00

Add description of function parameter 'mt' to fix kernel-doc warning:

  mm/page_alloc.c:3246: warning: Function parameter or member 'mt' not described in '__putback_isolated_page'

Signed-off-by: Randy Dunlap 
Signed-off-by: Andrew Morton 
Acked-by: Pankaj Gupta 
Link: http://lkml.kernel.org/r/02998bd4-0b82-2f15-2570-f86130304d1e@infradead.org
Signed-off-by: Linus Torvalds

mm: introduce Reported pages

2020-04-07T17:43:38+00:00

In order to pave the way for free page reporting in virtualized
environments we will need a way to get pages out of the free lists and
identify those pages after they have been returned.  To accomplish this,
this patch adds the concept of a Reported Buddy, which is essentially
meant to just be the Uptodate flag used in conjunction with the Buddy page
type.

To prevent the reported pages from leaking outside of the buddy lists I
added a check to clear the PageReported bit in the del_page_from_free_list
function.  As a result any reported page that is split, merged, or
allocated will have the flag cleared prior to the PageBuddy value being
cleared.

The process for reporting pages is fairly simple.  Once we free a page
that meets the minimum order for page reporting we will schedule a worker
thread to start 2s or more in the future.  That worker thread will begin
working from the lowest supported page reporting order up to MAX_ORDER - 1
pulling unreported pages from the free list and storing them in the
scatterlist.

When processing each individual free list it is necessary for the worker
thread to release the zone lock when it needs to stop and report the full
scatterlist of pages.  To reduce the work of the next iteration the worker
thread will rotate the free list so that the first unreported page in the
free list becomes the first entry in the list.

It will then call a reporting function providing information on how many
entries are in the scatterlist.  Once the function completes it will
return the pages to the free area from which they were allocated and start
over pulling more pages from the free areas until there are no longer
enough pages to report on to keep the worker busy, or we have processed as
many pages as were contained in the free area when we started processing
the list.

The worker thread will work in a round-robin fashion making its way though
each zone requesting reporting, and through each reportable free list
within that zone.  Once all free areas within the zone have been processed
it will check to see if there have been any requests for reporting while
it was processing.  If so it will reschedule the worker thread to start up
again in roughly 2s and exit.

Signed-off-by: Alexander Duyck 
Signed-off-by: Andrew Morton 
Acked-by: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: David Hildenbrand 
Cc: Konrad Rzeszutek Wilk 
Cc: Luiz Capitulino 
Cc: Matthew Wilcox 
Cc: Michael S. Tsirkin 
Cc: Michal Hocko 
Cc: Nitesh Narayan Lal 
Cc: Oscar Salvador 
Cc: Pankaj Gupta 
Cc: Paolo Bonzini 
Cc: Rik van Riel 
Cc: Vlastimil Babka 
Cc: Wei Wang 
Cc: Yang Zhang 
Cc: wei qi 
Link: http://lkml.kernel.org/r/20200211224635.29318.19750.stgit@localhost.localdomain
Signed-off-by: Linus Torvalds

mm: add function __putback_isolated_page

2020-04-07T17:43:38+00:00

There are cases where we would benefit from avoiding having to go through
the allocation and free cycle to return an isolated page.

Examples for this might include page poisoning in which we isolate a page
and then put it back in the free list without ever having actually
allocated it.

This will enable us to also avoid notifiers for the future free page
reporting which will need to avoid retriggering page reporting when
returning pages that have been reported on.

Signed-off-by: Alexander Duyck 
Signed-off-by: Andrew Morton 
Acked-by: David Hildenbrand 
Acked-by: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Cc: Dave Hansen 
Cc: Konrad Rzeszutek Wilk 
Cc: Luiz Capitulino 
Cc: Matthew Wilcox 
Cc: Michael S. Tsirkin 
Cc: Michal Hocko 
Cc: Nitesh Narayan Lal 
Cc: Oscar Salvador 
Cc: Pankaj Gupta 
Cc: Paolo Bonzini 
Cc: Rik van Riel 
Cc: Vlastimil Babka 
Cc: Wei Wang 
Cc: Yang Zhang 
Cc: wei qi 
Link: http://lkml.kernel.org/r/20200211224624.29318.89287.stgit@localhost.localdomain
Signed-off-by: Linus Torvalds

mm: use zone and order instead of free area in free_list manipulators

2020-04-07T17:43:38+00:00

In order to enable the use of the zone from the list manipulator functions
I will need access to the zone pointer.  As it turns out most of the
accessors were always just being directly passed &zone->free_area[order]
anyway so it would make sense to just fold that into the function itself
and pass the zone and order as arguments instead of the free area.

In order to be able to reference the zone we need to move the declaration
of the functions down so that we have the zone defined before we define
the list manipulation functions.  Since the functions are only used in the
file mm/page_alloc.c we can just move them there to reduce noise in the
header.

Signed-off-by: Alexander Duyck 
Signed-off-by: Andrew Morton 
Reviewed-by: Dan Williams 
Reviewed-by: David Hildenbrand 
Reviewed-by: Pankaj Gupta 
Acked-by: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dave Hansen 
Cc: Konrad Rzeszutek Wilk 
Cc: Luiz Capitulino 
Cc: Matthew Wilcox 
Cc: Michael S. Tsirkin 
Cc: Michal Hocko 
Cc: Nitesh Narayan Lal 
Cc: Oscar Salvador 
Cc: Paolo Bonzini 
Cc: Rik van Riel 
Cc: Vlastimil Babka 
Cc: Wei Wang 
Cc: Yang Zhang 
Cc: wei qi 
Link: http://lkml.kernel.org/r/20200211224613.29318.43080.stgit@localhost.localdomain
Signed-off-by: Linus Torvalds

mm: adjust shuffle code to allow for future coalescing

2020-04-07T17:43:38+00:00

Patch series "mm / virtio: Provide support for free page reporting", v17.

This series provides an asynchronous means of reporting free guest pages
to a hypervisor so that the memory associated with those pages can be
dropped and reused by other processes and/or guests on the host.  Using
this it is possible to avoid unnecessary I/O to disk and greatly improve
performance in the case of memory overcommit on the host.

When enabled we will be performing a scan of free memory every 2 seconds
while pages of sufficiently high order are being freed.  In each pass at
least one sixteenth of each free list will be reported.  By doing this we
avoid racing against other threads that may be causing a high amount of
memory churn.

The lowest page order currently scanned when reporting pages is
pageblock_order so that this feature will not interfere with the use of
Transparent Huge Pages in the case of virtualization.

Currently this is only in use by virtio-balloon however there is the hope
that at some point in the future other hypervisors might be able to make
use of it.  In the virtio-balloon/QEMU implementation the hypervisor is
currently using MADV_DONTNEED to indicate to the host kernel that the page
is currently free.  It will be zeroed and faulted back into the guest the
next time the page is accessed.

To track if a page is reported or not the Uptodate flag was repurposed and
used as a Reported flag for Buddy pages.  We walk though the free list
isolating pages and adding them to the scatterlist until we either
encounter the end of the list or have processed at least one sixteenth of
the pages that were listed in nr_free prior to us starting.  If we fill
the scatterlist before we reach the end of the list we rotate the list so
that the first unreported page we encounter is moved to the head of the
list as that is where we will resume after we have freed the reported
pages back into the tail of the list.

Below are the results from various benchmarks.  I primarily focused on two
tests.  The first is the will-it-scale/page_fault2 test, and the other is
a modified version of will-it-scale/page_fault1 that was enabled to use
THP.  I did this as it allows for better visibility into different parts
of the memory subsystem.  The guest is running with 32G for RAM on one
node of a E5-2630 v3.  The host has had some features such as CPU turbo
disabled in the BIOS.

Test                   page_fault1 (THP)    page_fault2
Name            tasks  Process Iter  STDEV  Process Iter  STDEV
Baseline            1    1012402.50  0.14%     361855.25  0.81%
                   16    8827457.25  0.09%    3282347.00  0.34%

Patches Applied     1    1007897.00  0.23%     361887.00  0.26%
                   16    8784741.75  0.39%    3240669.25  0.48%

Patches Enabled     1    1010227.50  0.39%     359749.25  0.56%
                   16    8756219.00  0.24%    3226608.75  0.97%

Patches Enabled     1    1050982.00  4.26%     357966.25  0.14%
 page shuffle      16    8672601.25  0.49%    3223177.75  0.40%

Patches enabled     1    1003238.00  0.22%     360211.00  0.22%
 shuffle w/ RFC    16    8767010.50  0.32%    3199874.00  0.71%

The results above are for a baseline with a linux-next-20191219 kernel,
that kernel with this patch set applied but page reporting disabled in
virtio-balloon, the patches applied and page reporting fully enabled, the
patches enabled with page shuffling enabled, and the patches applied with
page shuffling enabled and an RFC patch that makes used of MADV_FREE in
QEMU.  These results include the deviation seen between the average value
reported here versus the high and/or low value.  I observed that during
the test memory usage for the first three tests never dropped whereas with
the patches fully enabled the VM would drop to using only a few GB of the
host's memory when switching from memhog to page fault tests.

Any of the overhead visible with this patch set enabled seems due to page
faults caused by accessing the reported pages and the host zeroing the
page before giving it back to the guest.  This overhead is much more
visible when using THP than with standard 4K pages.  In addition page
shuffling seemed to increase the amount of faults generated due to an
increase in memory churn.  The overehad is reduced when using MADV_FREE as
we can avoid the extra zeroing of the pages when they are reintroduced to
the host, as can be seen when the RFC is applied with shuffling enabled.

The overall guest size is kept fairly small to only a few GB while the
test is running.  If the host memory were oversubscribed this patch set
should result in a performance improvement as swapping memory in the host
can be avoided.

A brief history on the background of free page reporting can be found at:
https://lore.kernel.org/lkml/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com/

This patch (of 9):

Move the head/tail adding logic out of the shuffle code and into the
__free_one_page function since ultimately that is where it is really
needed anyway.  By doing this we should be able to reduce the overhead and
can consolidate all of the list addition bits in one spot.

Signed-off-by: Alexander Duyck 
Signed-off-by: Andrew Morton 
Reviewed-by: Dan Williams 
Acked-by: Mel Gorman 
Acked-by: David Hildenbrand 
Cc: Yang Zhang 
Cc: Pankaj Gupta 
Cc: Konrad Rzeszutek Wilk 
Cc: Nitesh Narayan Lal 
Cc: Rik van Riel 
Cc: Matthew Wilcox 
Cc: Luiz Capitulino 
Cc: Dave Hansen 
Cc: Wei Wang 
Cc: Andrea Arcangeli 
Cc: Paolo Bonzini 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Oscar Salvador 
Cc: Michael S. Tsirkin 
Cc: wei qi 
Link: http://lkml.kernel.org/r/20200211224602.29318.84523.stgit@localhost.localdomain
Signed-off-by: Linus Torvalds

mm,thp,compaction,cma: allow THP migration for CMA allocations

2020-04-02T16:35:31+00:00

The code to implement THP migrations already exists, and the code for CMA
to clear out a region of memory already exists.

Only a few small tweaks are needed to allow CMA to move THP memory when
attempting an allocation from alloc_contig_range.

With these changes, migrating THPs from a CMA area works when allocating a
1GB hugepage from CMA memory.

[riel@surriel.com: fix hugetlbfs pages per Mike, cleanup per Vlastimil]
  Link: http://lkml.kernel.org/r/20200228104700.0af2f18d@imladris.surriel.com
Signed-off-by: Rik van Riel 
Signed-off-by: Andrew Morton 
Reviewed-by: Zi Yan 
Reviewed-by: Vlastimil Babka 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Mel Gorman 
Cc: David Rientjes 
Cc: Andrea Arcangeli 
Cc: Mike Kravetz 
Cc: Joonsoo Kim 
Link: http://lkml.kernel.org/r/20200227213238.1298752-2-riel@surriel.com
Signed-off-by: Linus Torvalds

mm,compaction,cma: add alloc_contig flag to compact_control

2020-04-02T16:35:31+00:00

Patch series "fix THP migration for CMA allocations", v2.

Transparent huge pages are allocated with __GFP_MOVABLE, and can end up in
CMA memory blocks.  Transparent huge pages also have most of the
infrastructure in place to allow migration.

However, a few pieces were missing, causing THP migration to fail when
attempting to use CMA to allocate 1GB hugepages.

With these patches in place, THP migration from CMA blocks seems to work,
both for anonymous THPs and for tmpfs/shmem THPs.

This patch (of 2):

Add information to struct compact_control to indicate that the allocator
would really like to clear out this specific part of memory, used by for
example CMA.

Signed-off-by: Rik van Riel 
Signed-off-by: Andrew Morton 
Reviewed-by: Vlastimil Babka 
Cc: Andrea Arcangeli 
Cc: David Rientjes 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Zi Yan 
Cc: Joonsoo Kim 
Link: http://lkml.kernel.org/r/20200227213238.1298752-1-riel@surriel.com
Signed-off-by: Linus Torvalds

mm/page_alloc: simplify page_is_buddy() for better code readability

2020-04-02T16:35:31+00:00

Simplify page_is_buddy() to reduce the redundant code for better code
readability.

Signed-off-by: chenqiwu 
Signed-off-by: Andrew Morton 
Reviewed-by: Alexander Duyck 
Reviewed-by: Matthew Wilcox (Oracle) 
Reviewed-by: Vlastimil Babka 
Acked-by: Pankaj Gupta 
Link: http://lkml.kernel.org/r/1583853751-5525-1-git-send-email-qiwuchen55@gmail.com
Signed-off-by: Linus Torvalds

mm/page_alloc.c: micro-optimisation Remove unnecessary branch

2020-04-02T16:35:31+00:00

Previously if branch condition was false, the assignment was not executed.
The assignment can be safely executed even when the condition is false
and it is not incorrect as it assigns the value of 'nodemask' to
'ac.nodemask' which already has the same value.

So as the assignment can be executed unconditionally, the branch can be
removed.

Signed-off-by: Mateusz Nosek 
Signed-off-by: Andrew Morton 
Reviewed-by: Matthew Wilcox (Oracle) 
Link: http://lkml.kernel.org/r/20200307225335.31300-1-mateusznosek0@gmail.com
Signed-off-by: Linus Torvalds