linux.git/mm/internal.h, branch v5.9

mm: replace hpage_nr_pages with thp_nr_pages

2020-08-15T02:56:56+00:00

The thp prefix is more frequently used than hpage and we should be
consistent between the various functions.

[akpm@linux-foundation.org: fix mm/migrate.c]

Signed-off-by: Matthew Wilcox (Oracle) 
Signed-off-by: Andrew Morton 
Reviewed-by: William Kucharski 
Reviewed-by: Zi Yan 
Cc: Mike Kravetz 
Cc: David Hildenbrand 
Cc: "Kirill A. Shutemov" 
Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
Signed-off-by: Linus Torvalds

mm: add thp_size

2020-08-15T02:56:56+00:00

This function returns the number of bytes in a THP.  It is like
page_size(), but compiles to just PAGE_SIZE if CONFIG_TRANSPARENT_HUGEPAGE
is disabled.

Signed-off-by: Matthew Wilcox (Oracle) 
Signed-off-by: Andrew Morton 
Reviewed-by: William Kucharski 
Reviewed-by: Zi Yan 
Cc: David Hildenbrand 
Cc: Mike Kravetz 
Cc: "Kirill A. Shutemov" 
Link: http://lkml.kernel.org/r/20200629151959.15779-5-willy@infradead.org
Signed-off-by: Linus Torvalds

mm/mempolicy: use a standard migration target allocation callback

2020-08-12T17:58:02+00:00

There is a well-defined migration target allocation callback.  Use it.

Signed-off-by: Joonsoo Kim 
Signed-off-by: Andrew Morton 
Acked-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Cc: Christoph Hellwig 
Cc: Mike Kravetz 
Cc: Naoya Horiguchi 
Cc: Roman Gushchin 
Link: http://lkml.kernel.org/r/1594622517-20681-7-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds

mm/migrate: introduce a standard migration target allocation function

2020-08-12T17:58:02+00:00

There are some similar functions for migration target allocation.  Since
there is no fundamental difference, it's better to keep just one rather
than keeping all variants.  This patch implements base migration target
allocation function.  In the following patches, variants will be converted
to use this function.

Changes should be mechanical, but, unfortunately, there are some
differences.  First, some callers' nodemask is assgined to NULL since NULL
nodemask will be considered as all available nodes, that is,
&node_states[N_MEMORY].  Second, for hugetlb page allocation, gfp_mask is
redefined as regular hugetlb allocation gfp_mask plus __GFP_THISNODE if
user provided gfp_mask has it.  This is because future caller of this
function requires to set this node constaint.  Lastly, if provided nodeid
is NUMA_NO_NODE, nodeid is set up to the node where migration source
lives.  It helps to remove simple wrappers for setting up the nodeid.

Note that PageHighmem() call in previous function is changed to open-code
"is_highmem_idx()" since it provides more readability.

[akpm@linux-foundation.org: tweak patch title, per Vlastimil]
[akpm@linux-foundation.org: fix typo in comment]

Signed-off-by: Joonsoo Kim 
Signed-off-by: Andrew Morton 
Acked-by: Vlastimil Babka 
Acked-by: Michal Hocko 
Cc: Christoph Hellwig 
Cc: Mike Kravetz 
Cc: Naoya Horiguchi 
Cc: Roman Gushchin 
Link: http://lkml.kernel.org/r/1594622517-20681-6-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds

mm: proactive compaction

2020-08-12T17:57:56+00:00

For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented.  Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency.  Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system.  Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping
allocation latencies low.

For a more proactive compaction, the approach taken here is to define a
new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
external fragmentation which kcompactd tries to maintain.

The tunable takes a value in range [0, 100], with a default of 20.

Note that a previous version of this patch [1] was found to introduce too
many tunables (per-order extfrag{low, high}), but this one reduces them to
just one sysctl.  Also, the new tunable is an opaque value instead of
asking for specific bounds of "external fragmentation", which would have
been difficult to estimate.  The internal interpretation of this opaque
value allows for future fine-tuning.

Currently, we use a simple translation from this tunable to [low, high]
"fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
The score for a node is defined as weighted mean of per-zone external
fragmentation.  A zone's present_pages determines its weight.

To periodically check per-node score, we reuse per-node kcompactd threads,
which are woken up every 500 milliseconds to check the same.  If a node's
score exceeds its high threshold (as derived from user-provided
proactiveness value), proactive compaction is started until its score
reaches its low threshold value.  By default, proactiveness is set to 20,
which implies threshold values of low=80 and high=90.

This patch is largely based on ideas from Michal Hocko [2].  See also the
LWN article [3].

Performance data
================

System: x64_64, 1T RAM, 80 CPU threads.
Kernel: 5.6.0-rc3 + this patch

echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

Before starting the driver, the system was fragmented from a userspace
program that allocates all memory and then for each 2M aligned section,
frees 3/4 of base pages using munmap.  The workload is mainly anonymous
userspace pages, which are easy to move around.  I intentionally avoided
unmovable pages in this test to see how much latency we incur when
hugepage allocations hit direct compaction.

1. Kernel hugepage allocation latencies

With the system in such a fragmented state, a kernel driver then allocates
as many hugepages as possible and measures allocation latency:

(all latency values are in microseconds)

- With vanilla 5.6.0-rc3

  percentile latency
  –––––––––– –––––––
	   5    7894
	  10    9496
	  25   12561
	  30   15295
	  40   18244
	  50   21229
	  60   27556
	  75   30147
	  80   31047
	  90   32859
	  95   33799

Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)

- With 5.6.0-rc3 + this patch, with proactiveness=20

sysctl -w vm.compaction_proactiveness=20

  percentile latency
  –––––––––– –––––––
	   5       2
	  10       2
	  25       3
	  30       3
	  40       3
	  50       4
	  60       4
	  75       4
	  80       4
	  90       5
	  95     429

Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)

2. JAVA heap allocation

In this test, we first fragment memory using the same method as for (1).

Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages.  We also set THP to madvise to
allow hugepage backing of this heap.

/usr/bin/time
 java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch

The above command allocates 700G of Java heap using hugepages.

- With vanilla 5.6.0-rc3

17.39user 1666.48system 27:37.89elapsed

- With 5.6.0-rc3 + this patch, with proactiveness=20

8.35user 194.58system 3:19.62elapsed

Elapsed time remains around 3:15, as proactiveness is further increased.

Note that proactive compaction happens throughout the runtime of these
workloads.  The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.

In the above Java workload, proactiveness is set to 20.  The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction.  As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80).  Repeat.

bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark.  kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.

Backoff behavior
================

Above workloads produce a memory state which is easy to compact.  However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off.  To test this aspect:

- Created a kernel driver that allocates almost all memory as hugepages
  followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
  with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
  (=> ~30 seconds between retries).

[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/

Signed-off-by: Nitin Gupta 
Signed-off-by: Andrew Morton 
Tested-by: Oleksandr Natalenko 
Reviewed-by: Vlastimil Babka 
Reviewed-by: Khalid Aziz 
Reviewed-by: Oleksandr Natalenko 
Cc: Vlastimil Babka 
Cc: Khalid Aziz 
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Matthew Wilcox 
Cc: Mike Kravetz 
Cc: Joonsoo Kim 
Cc: David Rientjes 
Cc: Nitin Gupta 
Cc: Oleksandr Natalenko 
Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds

mmap locking API: convert mmap_sem comments

2020-06-09T16:39:14+00:00

Convert comments that reference mmap_sem to reference mmap_lock instead.

[akpm@linux-foundation.org: fix up linux-next leftovers]
[akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
[akpm@linux-foundation.org: more linux-next fixups, per Michel]

Signed-off-by: Michel Lespinasse 
Signed-off-by: Andrew Morton 
Reviewed-by: Vlastimil Babka 
Reviewed-by: Daniel Jordan 
Cc: Davidlohr Bueso 
Cc: David Rientjes 
Cc: Hugh Dickins 
Cc: Jason Gunthorpe 
Cc: Jerome Glisse 
Cc: John Hubbard 
Cc: Laurent Dufour 
Cc: Liam Howlett 
Cc: Matthew Wilcox 
Cc: Peter Zijlstra 
Cc: Ying Han 
Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
Signed-off-by: Linus Torvalds

mmap locking API: use coccinelle to convert mmap_sem rwsem call sites

2020-06-09T16:39:14+00:00

This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse 
Signed-off-by: Andrew Morton 
Reviewed-by: Daniel Jordan 
Reviewed-by: Laurent Dufour 
Reviewed-by: Vlastimil Babka 
Cc: Davidlohr Bueso 
Cc: David Rientjes 
Cc: Hugh Dickins 
Cc: Jason Gunthorpe 
Cc: Jerome Glisse 
Cc: John Hubbard 
Cc: Liam Howlett 
Cc: Matthew Wilcox 
Cc: Peter Zijlstra 
Cc: Ying Han 
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds

mm: fix a typo in comment "strucure"->"structure"

2020-06-05T02:06:24+00:00

There is a typo in comment, fix it.

Signed-off-by: Ethon Paul 
Signed-off-by: Andrew Morton 
Reviewed-by: Ralph Campbell 
Link: http://lkml.kernel.org/r/20200411064723.15855-1-ethp@qq.com
Signed-off-by: Linus Torvalds

mm/vmscan.c: change prototype for shrink_page_list

2020-06-04T03:09:47+00:00

commit 3c710c1ad11b ("mm, vmscan extract shrink_page_list reclaim counters
into a struct") changed data type for the function, so changing return
type for funciton and its caller.

Signed-off-by: Vaneet Narang 
Signed-off-by: Maninder Singh 
Signed-off-by: Andrew Morton 
Acked-by: Michal Hocko 
Cc: Amit Sahrawat 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Link: http://lkml.kernel.org/r/1588168259-25604-1-git-send-email-maninder1.s@samsung.com
Signed-off-by: Linus Torvalds

mm/page_alloc: integrate classzone_idx and high_zoneidx

2020-06-04T03:09:44+00:00

classzone_idx is just different name for high_zoneidx now.  So, integrate
them and add some comment to struct alloc_context in order to reduce
future confusion about the meaning of this variable.

The accessor, ac_classzone_idx() is also removed since it isn't needed
after integration.

In addition to integration, this patch also renames high_zoneidx to
highest_zoneidx since it represents more precise meaning.

Signed-off-by: Joonsoo Kim 
Signed-off-by: Andrew Morton 
Reviewed-by: Baoquan He 
Acked-by: Vlastimil Babka 
Acked-by: David Rientjes 
Cc: Johannes Weiner 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Minchan Kim 
Cc: Ye Xiaolong 
Link: http://lkml.kernel.org/r/1587095923-7515-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds