linux.git/mm/memory_hotplug.c, branch v4.15

memory hotplug: fix comments when adding section

2017-11-16T02:21:07+00:00

Here, pfn_to_node should be page_to_nid.

Link: http://lkml.kernel.org/r/1510735205-22540-1-git-send-email-fan.du@intel.com
Signed-off-by: Fan Du 
Acked-by: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: remove timeout from __offline_memory

2017-11-16T02:21:02+00:00

We have a hardcoded 120s timeout after which the memory offline fails
basically since the hot remove has been introduced.  This is essentially
a policy implemented in the kernel.  Moreover there is no way to adjust
the timeout and so we are sometimes facing memory offline failures if
the system is under a heavy memory pressure or very intensive CPU
workload on large machines.

It is not very clear what purpose the timeout actually serves.  The
offline operation is interruptible by a signal so if userspace wants
some timeout based termination this can be done trivially by sending a
signal.

If there is a strong usecase to do this from the kernel then we should
do it properly and have a it tunable from the userspace with the timeout
disabled by default along with the explanation who uses it and for what
purporse.

Link: http://lkml.kernel.org/r/20170918070834.13083-3-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Cc: KAMEZAWA Hiroyuki 
Cc: Reza Arbab 
Cc: Yasuaki Ishimatsu 
Cc: Xishi Qiu 
Cc: Igor Mammedov 
Cc: Vitaly Kuznetsov 
Cc: Michael Ellerman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: do not fail offlining too early

2017-11-16T02:21:02+00:00

Patch series "mm, memory_hotplug: redefine memory offline retry logic", v2.

While testing memory hotplug on a large 4TB machine we have noticed that
memory offlining is just too eager to fail.  The primary reason is that
the retry logic is just too easy to give up.  We have 4 ways out of the
offline

	- we have a permanent failure (isolation or memory notifiers fail,
	  or hugetlb pages cannot be dropped)
	- userspace sends a signal
	- a hardcoded 120s timeout expires
	- page migration fails 5 times

This is way too convoluted and it doesn't scale very well.  We have seen
both temporary migration failures as well as 120s being triggered.
After removing those restrictions we were able to pass stress testing
during memory hot remove without any other negative side effects
observed.  Therefore I suggest dropping both hard coded policies.  I
couldn't have found any specific reason for them in the changelog.  I
neither didn't get any response [1] from Kamezawa.  If we need some
upper bound - e.g.  timeout based - then we should have a proper and
user defined policy for that.  In any case there should be a clear use
case when introducing it.

This patch (of 2):

Memory offlining can fail too eagerly under heavy memory pressure.

  page:ffffea22a646bd00 count:255 mapcount:252 mapping:ffff88ff926c9f38 index:0x3
  flags: 0x9855fe40010048(uptodate|active|mappedtodisk)
  page dumped because: isolation failed
  page->mem_cgroup:ffff8801cd662000
  memory offlining [mem 0x18b580000000-0x18b5ffffffff] failed

Isolation has failed here because the page is not on LRU.  Most probably
because it was on the pcp LRU cache or it has been removed from the LRU
already but it hasn't been freed yet.  In both cases the page doesn't
look non-migrable so retrying more makes sense.

__offline_pages seems rather cluttered when it comes to the retry logic.
We have 5 retries at maximum and a timeout.  We could argue whether the
timeout makes sense but failing just because of a race when somebody
isoltes a page from LRU or puts it on a pcp LRU lists is just wrong.  It
only takes it to race with a process which unmaps some pages and remove
them from the LRU list and we can fail the whole offline because of
something that is a temporary condition and actually not harmful for the
offline.

Please note that unmovable pages should be already excluded during
start_isolate_page_range.  We could argue that has_unmovable_pages is
racy and MIGRATE_MOVABLE check doesn't provide any hard guarantee either
but kernel zones (aka < ZONE_MOVABLE) will very likely detect unmovable
pages in most cases and movable zone shouldn't contain unmovable pages
at all.  Some of those pages might be pinned but not for ever because
that would be a bug on its own.  In any case the context is still
interruptible and so the userspace can easily bail out when the
operation takes too long.  This is certainly better behavior than a
hardcoded retry loop which is racy.

Fix this by removing the max retry count and only rely on the timeout
resp. interruption by a signal from the userspace.  Also retry rather
than fail when check_pages_isolated sees some !free pages because those
could be a result of the race as well.

Link: http://lkml.kernel.org/r/20170918070834.13083-2-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Cc: KAMEZAWA Hiroyuki 
Cc: Reza Arbab 
Cc: Yasuaki Ishimatsu 
Cc: Xishi Qiu 
Cc: Igor Mammedov 
Cc: Vitaly Kuznetsov 
Cc: Michael Ellerman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory_hotplug: define find_{smallest|biggest}_section_pfn as unsigned long

2017-10-04T00:54:26+00:00

find_{smallest|biggest}_section_pfn()s find the smallest/biggest section
and return the pfn of the section.  But the functions are defined as int.
So the functions always return 0x00000000 - 0xffffffff.  It means if
memory address is over 16TB, the functions does not work correctly.

To handle 64 bit value, the patch defines
find_{smallest|biggest}_section_pfn() as unsigned long.

Fixes: 815121d2b5cd ("memory_hotplug: clear zone when removing the memory")
Link: http://lkml.kernel.org/r/d9d5593a-d0a4-c4be-ab08-493df59a85c6@gmail.com
Signed-off-by: Yasuaki Ishimatsu 
Acked-by: Michal Hocko 
Cc: Xishi Qiu 
Cc: Reza Arbab 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory_hotplug: change pfn_to_section_nr/section_nr_to_pfn macro to inline function

2017-10-04T00:54:25+00:00

pfn_to_section_nr() and section_nr_to_pfn() are defined as macro.
pfn_to_section_nr() has no issue even if it is defined as macro.  But
section_nr_to_pfn() has overflow issue if sec is defined as int.

section_nr_to_pfn() just shifts sec by PFN_SECTION_SHIFT.  If sec is
defined as unsigned long, section_nr_to_pfn() returns pfn as 64 bit value.
But if sec is defined as int, section_nr_to_pfn() returns pfn as 32 bit
value.

__remove_section() calculates start_pfn using section_nr_to_pfn() and
scn_nr defined as int.  So if hot-removed memory address is over 16TB,
overflow issue occurs and section_nr_to_pfn() does not calculate correct
pfn.

To make callers use proper arg, the patch changes the macros to inline
functions.

Fixes: 815121d2b5cd ("memory_hotplug: clear zone when removing the memory")
Link: http://lkml.kernel.org/r/e643a387-e573-6bbf-d418-c60c8ee3d15e@gmail.com
Signed-off-by: Yasuaki Ishimatsu 
Acked-by: Michal Hocko 
Cc: Xishi Qiu 
Cc: Reza Arbab 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: add scheduling point to __add_pages

2017-10-04T00:54:25+00:00

Patch series "mm, memory_hotplug: fix few soft lockups in memory
hotadd".

Johannes has noticed few soft lockups when adding a large nvdimm device.
All of them were caused by a long loop without any explicit cond_resched
which is a problem for !PREEMPT kernels.

The fix is quite straightforward.  Just make sure that cond_resched gets
called from time to time.

This patch (of 3):

__add_pages gets a pfn range to add and there is no upper bound for a
single call.  This is usually a memory block aligned size for the
regular memory hotplug - smaller sizes are usual for memory balloning
drivers, or the whole NUMA node for physical memory online.  There is no
explicit scheduling point in that code path though.

This can lead to long latencies while __add_pages is executed and we
have even seen a soft lockup report during nvdimm initialization with
!PREEMPT kernel

  NMI watchdog: BUG: soft lockup - CPU#11 stuck for 23s! [kworker/u641:3:832]
  [...]
  Workqueue: events_unbound async_run_entry_fn
  task: ffff881809270f40 ti: ffff881809274000 task.ti: ffff881809274000
  RIP: _raw_spin_unlock_irqrestore+0x11/0x20
  RSP: 0018:ffff881809277b10  EFLAGS: 00000286
  [...]
  Call Trace:
    sparse_add_one_section+0x13d/0x18e
    __add_pages+0x10a/0x1d0
    arch_add_memory+0x4a/0xc0
    devm_memremap_pages+0x29d/0x430
    pmem_attach_disk+0x2fd/0x3f0 [nd_pmem]
    nvdimm_bus_probe+0x64/0x110 [libnvdimm]
    driver_probe_device+0x1f7/0x420
    bus_for_each_drv+0x52/0x80
    __device_attach+0xb0/0x130
    bus_probe_device+0x87/0xa0
    device_add+0x3fc/0x5f0
    nd_async_device_register+0xe/0x40 [libnvdimm]
    async_run_entry_fn+0x43/0x150
    process_one_work+0x14e/0x410
    worker_thread+0x116/0x490
    kthread+0xc7/0xe0
    ret_from_fork+0x3f/0x70
  DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70

Fix this by adding cond_resched once per each memory section in the
given pfn range.  Each section is constant amount of work which itself
is not too expensive but many of them will just add up.

Link: http://lkml.kernel.org/r/20170918121410.24466-2-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Reported-by: Johannes Thumshirn 
Tested-by: Johannes Thumshirn 
Cc: Dan Williams 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory

2017-09-09T01:26:46+00:00

HMM (heterogeneous memory management) need struct page to support
migration from system main memory to device memory.  Reasons for HMM and
migration to device memory is explained with HMM core patch.

This patch deals with device memory that is un-addressable memory (ie CPU
can not access it).  Hence we do not want those struct page to be manage
like regular memory.  That is why we extend ZONE_DEVICE to support
different types of memory.

A persistent memory type is define for existing user of ZONE_DEVICE and a
new device un-addressable type is added for the un-addressable memory
type.  There is a clear separation between what is expected from each
memory type and existing user of ZONE_DEVICE are un-affected by new
requirement and new use of the un-addressable type.  All specific code
path are protect with test against the memory type.

Because memory is un-addressable we use a new special swap type for when a
page is migrated to device memory (this reduces the number of maximum swap
file).

The main two additions beside memory type to ZONE_DEVICE is two callbacks.
First one, page_free() is call whenever page refcount reach 1 (which
means the page is free as ZONE_DEVICE page never reach a refcount of 0).
This allow device driver to manage its memory and associated struct page.

The second callback page_fault() happens when there is a CPU access to an
address that is back by a device page (which are un-addressable by the
CPU).  This callback is responsible to migrate the page back to system
main memory.  Device driver can not block migration back to system memory,
HMM make sure that such page can not be pin into device memory.

If device is in some error condition and can not migrate memory back then
a CPU page fault to device memory should end with SIGBUS.

[arnd@arndb.de: fix warning]
  Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.com
Signed-off-by: Jérôme Glisse 
Signed-off-by: Arnd Bergmann 
Acked-by: Dan Williams 
Cc: Ross Zwisler 
Cc: Aneesh Kumar 
Cc: Balbir Singh 
Cc: Benjamin Herrenschmidt 
Cc: David Nellans 
Cc: Evgeny Baskakov 
Cc: Johannes Weiner 
Cc: John Hubbard 
Cc: Kirill A. Shutemov 
Cc: Mark Hairgrove 
Cc: Michal Hocko 
Cc: Paul E. McKenney 
Cc: Sherry Cheung 
Cc: Subhash Gutti 
Cc: Vladimir Davydov 
Cc: Bob Liu 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: memory_hotplug: memory hotremove supports thp migration

2017-09-09T01:26:45+00:00

This patch enables thp migration for memory hotremove.

Link: http://lkml.kernel.org/r/20170717193955.20207-11-zi.yan@sent.com
Signed-off-by: Naoya Horiguchi 
Signed-off-by: Zi Yan 
Cc: "H. Peter Anvin" 
Cc: Anshuman Khandual 
Cc: Dave Hansen 
Cc: David Nellans 
Cc: Ingo Molnar 
Cc: Kirill A. Shutemov 
Cc: Mel Gorman 
Cc: Minchan Kim 
Cc: Thomas Gleixner 
Cc: Vlastimil Babka 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: get rid of zonelists_mutex

2017-09-07T00:27:26+00:00

zonelists_mutex was introduced by commit 4eaf3f64397c ("mem-hotplug: fix
potential race while building zonelist for new populated zone") to
protect zonelist building from races.  This is no longer needed though
because both memory online and offline are fully serialized.  New users
have grown since then.

Notably setup_per_zone_wmarks wants to prevent from races between memory
hotplug, khugepaged setup and manual min_free_kbytes update via sysctl
(see cfd3da1e49bb ("mm: Serialize access to min_free_kbytes").  Let's
add a private lock for that purpose.  This will not prevent from seeing
halfway through memory hotplug operation but that shouldn't be a big
deal becuse memory hotplug will update watermarks explicitly so we will
eventually get a full picture.  The lock just makes sure we won't race
when updating watermarks leading to weird results.

Also __build_all_zonelists manipulates global data so add a private lock
for it as well.  This doesn't seem to be necessary today but it is more
robust to have a lock there.

While we are at it make sure we document that memory online/offline
depends on a full serialization either via mem_hotplug_begin() or
device_lock.

Link: http://lkml.kernel.org/r/20170721143915.14161-9-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Cc: Johannes Weiner 
Cc: Joonsoo Kim 
Cc: Mel Gorman 
Cc: Shaohua Li 
Cc: Toshi Kani 
Cc: Vlastimil Babka 
Cc: Haicheng Li 
Cc: Wu Fengguang 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: remove explicit build_all_zonelists from try_online_node

2017-09-07T00:27:26+00:00

try_online_node calls hotadd_new_pgdat which already calls
build_all_zonelists.  So the additional call is redundant.  Even though
hotadd_new_pgdat will only initialize zonelists of the new node this is
the right thing to do because such a node doesn't have any memory so
other zonelists would ignore all the zones from this node anyway.

Link: http://lkml.kernel.org/r/20170721143915.14161-6-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Cc: Toshi Kani 
Cc: Johannes Weiner 
Cc: Joonsoo Kim 
Cc: Mel Gorman 
Cc: Shaohua Li 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds