linux-stable.git/mm/memory_hotplug.c, branch v4.14.78

mm/memory_hotplug: define find_{smallest|biggest}_section_pfn as unsigned long

2017-10-04T00:54:26+00:00

find_{smallest|biggest}_section_pfn()s find the smallest/biggest section
and return the pfn of the section.  But the functions are defined as int.
So the functions always return 0x00000000 - 0xffffffff.  It means if
memory address is over 16TB, the functions does not work correctly.

To handle 64 bit value, the patch defines
find_{smallest|biggest}_section_pfn() as unsigned long.

Fixes: 815121d2b5cd ("memory_hotplug: clear zone when removing the memory")
Link: http://lkml.kernel.org/r/d9d5593a-d0a4-c4be-ab08-493df59a85c6@gmail.com
Signed-off-by: Yasuaki Ishimatsu 
Acked-by: Michal Hocko 
Cc: Xishi Qiu 
Cc: Reza Arbab 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory_hotplug: change pfn_to_section_nr/section_nr_to_pfn macro to inline function

2017-10-04T00:54:25+00:00

pfn_to_section_nr() and section_nr_to_pfn() are defined as macro.
pfn_to_section_nr() has no issue even if it is defined as macro.  But
section_nr_to_pfn() has overflow issue if sec is defined as int.

section_nr_to_pfn() just shifts sec by PFN_SECTION_SHIFT.  If sec is
defined as unsigned long, section_nr_to_pfn() returns pfn as 64 bit value.
But if sec is defined as int, section_nr_to_pfn() returns pfn as 32 bit
value.

__remove_section() calculates start_pfn using section_nr_to_pfn() and
scn_nr defined as int.  So if hot-removed memory address is over 16TB,
overflow issue occurs and section_nr_to_pfn() does not calculate correct
pfn.

To make callers use proper arg, the patch changes the macros to inline
functions.

Fixes: 815121d2b5cd ("memory_hotplug: clear zone when removing the memory")
Link: http://lkml.kernel.org/r/e643a387-e573-6bbf-d418-c60c8ee3d15e@gmail.com
Signed-off-by: Yasuaki Ishimatsu 
Acked-by: Michal Hocko 
Cc: Xishi Qiu 
Cc: Reza Arbab 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: add scheduling point to __add_pages

2017-10-04T00:54:25+00:00

Patch series "mm, memory_hotplug: fix few soft lockups in memory
hotadd".

Johannes has noticed few soft lockups when adding a large nvdimm device.
All of them were caused by a long loop without any explicit cond_resched
which is a problem for !PREEMPT kernels.

The fix is quite straightforward.  Just make sure that cond_resched gets
called from time to time.

This patch (of 3):

__add_pages gets a pfn range to add and there is no upper bound for a
single call.  This is usually a memory block aligned size for the
regular memory hotplug - smaller sizes are usual for memory balloning
drivers, or the whole NUMA node for physical memory online.  There is no
explicit scheduling point in that code path though.

This can lead to long latencies while __add_pages is executed and we
have even seen a soft lockup report during nvdimm initialization with
!PREEMPT kernel

  NMI watchdog: BUG: soft lockup - CPU#11 stuck for 23s! [kworker/u641:3:832]
  [...]
  Workqueue: events_unbound async_run_entry_fn
  task: ffff881809270f40 ti: ffff881809274000 task.ti: ffff881809274000
  RIP: _raw_spin_unlock_irqrestore+0x11/0x20
  RSP: 0018:ffff881809277b10  EFLAGS: 00000286
  [...]
  Call Trace:
    sparse_add_one_section+0x13d/0x18e
    __add_pages+0x10a/0x1d0
    arch_add_memory+0x4a/0xc0
    devm_memremap_pages+0x29d/0x430
    pmem_attach_disk+0x2fd/0x3f0 [nd_pmem]
    nvdimm_bus_probe+0x64/0x110 [libnvdimm]
    driver_probe_device+0x1f7/0x420
    bus_for_each_drv+0x52/0x80
    __device_attach+0xb0/0x130
    bus_probe_device+0x87/0xa0
    device_add+0x3fc/0x5f0
    nd_async_device_register+0xe/0x40 [libnvdimm]
    async_run_entry_fn+0x43/0x150
    process_one_work+0x14e/0x410
    worker_thread+0x116/0x490
    kthread+0xc7/0xe0
    ret_from_fork+0x3f/0x70
  DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70

Fix this by adding cond_resched once per each memory section in the
given pfn range.  Each section is constant amount of work which itself
is not too expensive but many of them will just add up.

Link: http://lkml.kernel.org/r/20170918121410.24466-2-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Reported-by: Johannes Thumshirn 
Tested-by: Johannes Thumshirn 
Cc: Dan Williams 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory

2017-09-09T01:26:46+00:00

HMM (heterogeneous memory management) need struct page to support
migration from system main memory to device memory.  Reasons for HMM and
migration to device memory is explained with HMM core patch.

This patch deals with device memory that is un-addressable memory (ie CPU
can not access it).  Hence we do not want those struct page to be manage
like regular memory.  That is why we extend ZONE_DEVICE to support
different types of memory.

A persistent memory type is define for existing user of ZONE_DEVICE and a
new device un-addressable type is added for the un-addressable memory
type.  There is a clear separation between what is expected from each
memory type and existing user of ZONE_DEVICE are un-affected by new
requirement and new use of the un-addressable type.  All specific code
path are protect with test against the memory type.

Because memory is un-addressable we use a new special swap type for when a
page is migrated to device memory (this reduces the number of maximum swap
file).

The main two additions beside memory type to ZONE_DEVICE is two callbacks.
First one, page_free() is call whenever page refcount reach 1 (which
means the page is free as ZONE_DEVICE page never reach a refcount of 0).
This allow device driver to manage its memory and associated struct page.

The second callback page_fault() happens when there is a CPU access to an
address that is back by a device page (which are un-addressable by the
CPU).  This callback is responsible to migrate the page back to system
main memory.  Device driver can not block migration back to system memory,
HMM make sure that such page can not be pin into device memory.

If device is in some error condition and can not migrate memory back then
a CPU page fault to device memory should end with SIGBUS.

[arnd@arndb.de: fix warning]
  Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.com
Signed-off-by: Jérôme Glisse 
Signed-off-by: Arnd Bergmann 
Acked-by: Dan Williams 
Cc: Ross Zwisler 
Cc: Aneesh Kumar 
Cc: Balbir Singh 
Cc: Benjamin Herrenschmidt 
Cc: David Nellans 
Cc: Evgeny Baskakov 
Cc: Johannes Weiner 
Cc: John Hubbard 
Cc: Kirill A. Shutemov 
Cc: Mark Hairgrove 
Cc: Michal Hocko 
Cc: Paul E. McKenney 
Cc: Sherry Cheung 
Cc: Subhash Gutti 
Cc: Vladimir Davydov 
Cc: Bob Liu 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: memory_hotplug: memory hotremove supports thp migration

2017-09-09T01:26:45+00:00

This patch enables thp migration for memory hotremove.

Link: http://lkml.kernel.org/r/20170717193955.20207-11-zi.yan@sent.com
Signed-off-by: Naoya Horiguchi 
Signed-off-by: Zi Yan 
Cc: "H. Peter Anvin" 
Cc: Anshuman Khandual 
Cc: Dave Hansen 
Cc: David Nellans 
Cc: Ingo Molnar 
Cc: Kirill A. Shutemov 
Cc: Mel Gorman 
Cc: Minchan Kim 
Cc: Thomas Gleixner 
Cc: Vlastimil Babka 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: get rid of zonelists_mutex

2017-09-07T00:27:26+00:00

zonelists_mutex was introduced by commit 4eaf3f64397c ("mem-hotplug: fix
potential race while building zonelist for new populated zone") to
protect zonelist building from races.  This is no longer needed though
because both memory online and offline are fully serialized.  New users
have grown since then.

Notably setup_per_zone_wmarks wants to prevent from races between memory
hotplug, khugepaged setup and manual min_free_kbytes update via sysctl
(see cfd3da1e49bb ("mm: Serialize access to min_free_kbytes").  Let's
add a private lock for that purpose.  This will not prevent from seeing
halfway through memory hotplug operation but that shouldn't be a big
deal becuse memory hotplug will update watermarks explicitly so we will
eventually get a full picture.  The lock just makes sure we won't race
when updating watermarks leading to weird results.

Also __build_all_zonelists manipulates global data so add a private lock
for it as well.  This doesn't seem to be necessary today but it is more
robust to have a lock there.

While we are at it make sure we document that memory online/offline
depends on a full serialization either via mem_hotplug_begin() or
device_lock.

Link: http://lkml.kernel.org/r/20170721143915.14161-9-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Cc: Johannes Weiner 
Cc: Joonsoo Kim 
Cc: Mel Gorman 
Cc: Shaohua Li 
Cc: Toshi Kani 
Cc: Vlastimil Babka 
Cc: Haicheng Li 
Cc: Wu Fengguang 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: remove explicit build_all_zonelists from try_online_node

2017-09-07T00:27:26+00:00

try_online_node calls hotadd_new_pgdat which already calls
build_all_zonelists.  So the additional call is redundant.  Even though
hotadd_new_pgdat will only initialize zonelists of the new node this is
the right thing to do because such a node doesn't have any memory so
other zonelists would ignore all the zones from this node anyway.

Link: http://lkml.kernel.org/r/20170721143915.14161-6-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Cc: Toshi Kani 
Cc: Johannes Weiner 
Cc: Joonsoo Kim 
Cc: Mel Gorman 
Cc: Shaohua Li 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: drop zone from build_all_zonelists

2017-09-07T00:27:25+00:00

build_all_zonelists gets a zone parameter to initialize zone's pagesets.
There is only a single user which gives a non-NULL zone parameter and
that one doesn't really need the rest of the build_all_zonelists (see
commit 6dcd73d7011b ("memory-hotplug: allocate zone's pcp before
onlining pages")).

Therefore remove setup_zone_pageset from build_all_zonelists and call it
from its only user directly.  This will also remove a pointless zonlists
rebuilding which is always good.

Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Cc: Johannes Weiner 
Cc: Joonsoo Kim 
Cc: Mel Gorman 
Cc: Shaohua Li 
Cc: Toshi Kani 
Cc: Wen Congyang 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: remove zone restrictions

2017-09-07T00:27:25+00:00

Historically we have enforced that any kernel zone (e.g ZONE_NORMAL) has
to precede the Movable zone in the physical memory range.  The purpose
of the movable zone is, however, not bound to any physical memory
restriction.  It merely defines a class of migrateable and reclaimable
memory.

There are users (e.g.  CMA) who might want to reserve specific physical
memory ranges for their own purpose.  Moreover our pfn walkers have to
be prepared for zones overlapping in the physical range already because
we do support interleaving NUMA nodes and therefore zones can interleave
as well.  This means we can allow each memory block to be associated
with a different zone.

Loosen the current onlining semantic and allow explicit onlining type on
any memblock.  That means that online_{kernel,movable} will be allowed
regardless of the physical address of the memblock as long as it is
offline of course.  This might result in moveble zone overlapping with
other kernel zones.  Default onlining then becomes a bit tricky but
still sensible.  echo online > memoryXY/state will online the given
block to

	1) the default zone if the given range is outside of any zone
	2) the enclosing zone if such a zone doesn't interleave with
	   any other zone
        3) the default zone if more zones interleave for this range

where default zone is movable zone only if movable_node is enabled
otherwise it is a kernel zone.

Here is an example of the semantic with (movable_node is not present but
it work in an analogous way). We start with following memblocks, all of
them offline:

  memory34/valid_zones:Normal Movable
  memory35/valid_zones:Normal Movable
  memory36/valid_zones:Normal Movable
  memory37/valid_zones:Normal Movable
  memory38/valid_zones:Normal Movable
  memory39/valid_zones:Normal Movable
  memory40/valid_zones:Normal Movable
  memory41/valid_zones:Normal Movable

Now, we online block 34 in default mode and block 37 as movable

  root@test1:/sys/devices/system/node/node1# echo online > memory34/state
  root@test1:/sys/devices/system/node/node1# echo online_movable > memory37/state
  memory34/valid_zones:Normal
  memory35/valid_zones:Normal Movable
  memory36/valid_zones:Normal Movable
  memory37/valid_zones:Movable
  memory38/valid_zones:Normal Movable
  memory39/valid_zones:Normal Movable
  memory40/valid_zones:Normal Movable
  memory41/valid_zones:Normal Movable

As we can see all other blocks can still be onlined both into Normal and
Movable zones and the Normal is default because the Movable zone spans
only block37 now.

  root@test1:/sys/devices/system/node/node1# echo online_movable > memory41/state
  memory34/valid_zones:Normal
  memory35/valid_zones:Normal Movable
  memory36/valid_zones:Normal Movable
  memory37/valid_zones:Movable
  memory38/valid_zones:Movable Normal
  memory39/valid_zones:Movable Normal
  memory40/valid_zones:Movable Normal
  memory41/valid_zones:Movable

Now the default zone for blocks 37-41 has changed because movable zone
spans that range.

  root@test1:/sys/devices/system/node/node1# echo online_kernel > memory39/state
  memory34/valid_zones:Normal
  memory35/valid_zones:Normal Movable
  memory36/valid_zones:Normal Movable
  memory37/valid_zones:Movable
  memory38/valid_zones:Normal Movable
  memory39/valid_zones:Normal
  memory40/valid_zones:Movable Normal
  memory41/valid_zones:Movable

Note that the block 39 now belongs to the zone Normal and so block38
falls into Normal by default as well.

For completness

  root@test1:/sys/devices/system/node/node1# for i in memory[34]?
  do
	echo online > $i/state 2>/dev/null
  done

  memory34/valid_zones:Normal
  memory35/valid_zones:Normal
  memory36/valid_zones:Normal
  memory37/valid_zones:Movable
  memory38/valid_zones:Normal
  memory39/valid_zones:Normal
  memory40/valid_zones:Movable
  memory41/valid_zones:Movable

Implementation wise the change is quite straightforward.  We can get rid
of allow_online_pfn_range altogether.  online_pages allows only offline
nodes already.  The original default_zone_for_pfn will become
default_kernel_zone_for_pfn.  New default_zone_for_pfn implements the
above semantic.  zone_for_pfn_range is slightly reorganized to implement
kernel and movable online type explicitly and MMOP_ONLINE_KEEP becomes a
catch all default behavior.

Link: http://lkml.kernel.org/r/20170714121233.16861-3-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Acked-by: Joonsoo Kim 
Acked-by: Vlastimil Babka 
Acked-by: Reza Arbab 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Yasuaki Ishimatsu 
Cc: Xishi Qiu 
Cc: Kani Toshimitsu 
Cc: 
Cc: Daniel Kiper 
Cc: Igor Mammedov 
Cc: Vitaly Kuznetsov 
Cc: Wei Yang 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: display allowed zones in the preferred ordering

2017-09-07T00:27:25+00:00

Prior to commit f1dd2cd13c4b ("mm, memory_hotplug: do not associate
hotadded memory to zones until online") we used to allow to change the
valid zone types of a memory block if it is adjacent to a different zone
type.

This fact was reflected in memoryNN/valid_zones by the ordering of
printed zones.  The first one was default (echo online > memoryNN/state)
and the other one could be onlined explicitly by online_{movable,kernel}.

This behavior was removed by the said patch and as such the ordering was
not all that important.  In most cases a kernel zone would be default
anyway.  The only exception is movable_node handled by "mm,
memory_hotplug: support movable_node for hotpluggable nodes".

Let's reintroduce this behavior again because later patch will remove
the zone overlap restriction and so user will be allowed to online
kernel resp.  movable block regardless of its placement.  Original
behavior will then become significant again because it would be
non-trivial for users to see what is the default zone to online into.

Implementation is really simple.  Pull out zone selection out of
move_pfn_range into zone_for_pfn_range helper and use it in
show_valid_zones to display the zone for default onlining and then both
kernel and movable if they are allowed.  Default online zone is not
duplicated.

Link: http://lkml.kernel.org/r/20170714121233.16861-2-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Acked-by: Joonsoo Kim 
Acked-by: Vlastimil Babka 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Reza Arbab 
Cc: Yasuaki Ishimatsu 
Cc: Xishi Qiu 
Cc: Kani Toshimitsu 
Cc: 
Cc: Daniel Kiper 
Cc: Igor Mammedov 
Cc: Vitaly Kuznetsov 
Cc: Wei Yang 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds