linux.git/mm/memory_hotplug.c, branch v5.0-rc2

mm, memory_hotplug: deobfuscate migration part of offlining

2018-12-28T20:11:50+00:00

Memory migration might fail during offlining and we keep retrying in that
case.  This is currently obfuscated by goto retry loop.  The code is hard
to follow and as a result it is even suboptimal becase each retry round
scans the full range from start_pfn even though we have successfully
scanned/migrated [start_pfn, pfn] range already.  This is all only because
check_pages_isolated failure has to rescan the full range again.

De-obfuscate the migration retry loop by promoting it to a real for loop.
In fact remove the goto altogether by making it a proper double loop
(yeah, gotos are nasty in this specific case).  In the end we will get a
slightly more optimal code which is better readable.

[akpm@linux-foundation.org: reflow comments to 80 cols]
Link: http://lkml.kernel.org/r/20181211142741.2607-3-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Reviewed-by: David Hildenbrand 
Reviewed-by: Oscar Salvador 
Cc: Hugh Dickins 
Cc: Jan Kara 
Cc: Kirill A. Shutemov 
Cc: Pavel Tatashin 
Cc: William Kucharski 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: try to migrate full pfn range

2018-12-28T20:11:50+00:00

Patch series "few memory offlining enhancements".

I have been chasing memory offlining not making progress recently.  On the
way I have noticed few weird decisions in the code.  The migration itself
is restricted without a reasonable justification and the retry loop around
the migration is quite messy.  This is addressed by patch 1 and patch 2.

Patch 3 is targeting on the faultaround code which has been a hot
candidate for the initial issue reported upstream [2] and that I am
debugging internally.  It turned out to be not the main contributor in the
end but I believe we should address it regardless.  See the patch
description for more details.

[1] http://lkml.kernel.org/r/20181120134323.13007-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20181114070909.GB2653@MiWiFi-R3L-srv

This patch (of 3):

do_migrate_range has been limiting the number of pages to migrate to 256
for some reason which is not documented.  Even if the limit made some
sense back then when it was introduced it doesn't really serve a good
purpose these days.  If the range contains huge pages then we break out of
the loop too early and go through LRU and pcp caches draining and
scan_movable_pages is quite suboptimal.

The only reason to limit the number of pages I can think of is to reduce
the potential time to react on the fatal signal.  But even then the number
of pages is a questionable metric because even a single page migration
might block in a non-killable state (e.g.  __unmap_and_move).

Remove the limit and offline the full requested range (this is one
memblock worth of pages with the current code).  Should we ever get a
report that offlining takes too long to react on fatal signal then we
should rather fix the core migration to use killable waits and bailout
on a signal.

Link: http://lkml.kernel.org/r/20181211142741.2607-1-mhocko@kernel.org
Link: http://lkml.kernel.org/r/20181211142741.2607-2-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Reviewed-by: David Hildenbrand 
Reviewed-by: Pavel Tatashin 
Reviewed-by: Oscar Salvador 
Cc: Hugh Dickins 
Cc: Jan Kara 
Cc: Kirill A. Shutemov 
Cc: William Kucharski 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined

2018-12-28T20:11:50+00:00

We have received a bug report that an injected MCE about faulty memory
prevents memory offline to succeed on 4.4 base kernel.  The underlying
reason was that the HWPoison page has an elevated reference count and the
migration keeps failing.  There are two problems with that.  First of all
it is dubious to migrate the poisoned page because we know that accessing
that memory is possible to fail.  Secondly it doesn't make any sense to
migrate a potentially broken content and preserve the memory corruption
over to a new location.

Oscar has found out that 4.4 and the current upstream kernels behave
slightly differently with his simply testcase

===

int main(void)
{
        int ret;
        int i;
        int fd;
        char *array = malloc(4096);
        char *array_locked = malloc(4096);

        fd = open("/tmp/data", O_RDONLY);
        read(fd, array, 4095);

        for (i = 0; i < 4096; i++)
                array_locked[i] = 'd';

        ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked));
        if (ret)
                perror("mlock");

        sleep (20);

        ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON);
        if (ret)
                perror("madvise");

        for (i = 0; i < 4096; i++)
                array_locked[i] = 'd';

        return 0;
}
===

+ offline this memory.

In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
list
kernel:  [] dump_trace+0x59/0x340
kernel:  [] show_stack_log_lvl+0xea/0x170
kernel:  [] show_stack+0x21/0x40
kernel:  [] dump_stack+0x5c/0x7c
kernel:  [] warn_slowpath_common+0x81/0xb0
kernel:  [] __pagevec_lru_add_fn+0x14c/0x160
kernel:  [] pagevec_lru_move_fn+0xad/0x100
kernel:  [] __lru_cache_add+0x6c/0xb0
kernel:  [] add_to_page_cache_lru+0x46/0x70
kernel:  [] extent_readpages+0xc3/0x1a0 [btrfs]
kernel:  [] __do_page_cache_readahead+0x177/0x200
kernel:  [] ondemand_readahead+0x168/0x2a0
kernel:  [] generic_file_read_iter+0x41f/0x660
kernel:  [] __vfs_read+0xcd/0x140
kernel:  [] vfs_read+0x7a/0x120
kernel:  [] kernel_read+0x3b/0x50
kernel:  [] do_execveat_common.isra.29+0x490/0x6f0
kernel:  [] do_execve+0x28/0x30
kernel:  [] call_usermodehelper_exec_async+0xfb/0x130
kernel:  [] ret_from_fork+0x55/0x80

And that latter confuses the hotremove path because an LRU page is
attempted to be migrated and that fails due to an elevated reference
count.  It is quite possible that the reuse of the HWPoisoned page is some
kind of fixed race condition but I am not really sure about that.

With the upstream kernel the failure is slightly different.  The page
doesn't seem to have LRU bit set but isolate_movable_page simply fails and
do_migrate_range simply puts all the isolated pages back to LRU and
therefore no progress is made and scan_movable_pages finds same set of
pages over and over again.

Fix both cases by explicitly checking HWPoisoned pages before we even try
to get reference on the page, try to unmap it if it is still mapped.  As
explained by Naoya:

: Hwpoison code never unmapped those for no big reason because
: Ksm pages never dominate memory, so we simply didn't have strong
: motivation to save the pages.

Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
HWPoison pages which shouldn't happen but I couldn't convince myself about
that.  Naoya has noted the following:

: Theoretically no such gurantee, because try_to_unmap() doesn't have a
: guarantee of success and then memory_failure() returns immediately
: when hwpoison_user_mappings fails.
: Or the following code (comes after hwpoison_user_mappings block) also impli=
: es
: that the target page can still have PageLRU flag.
:
:         /*
:          * Torn down by someone else?
:          */
:         if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) {
:                 action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
:                 res =3D -EBUSY;
:                 goto out;
:         }
:
: So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
: current version of your patch.

Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Reviewed-by: Oscar Salvador 
Debugged-by: Oscar Salvador 
Tested-by: Oscar Salvador 
Acked-by: David Hildenbrand 
Acked-by: Naoya Horiguchi 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, hotplug: move init_currently_empty_zone() under zone_span_lock protection

2018-12-28T20:11:49+00:00

During online_pages phase, pgdat->nr_zones will be updated in case this
zone is empty.

Currently the online_pages phase is protected by the global locks
(device_device_hotplug_lock and mem_hotplug_lock), which ensures there is
no contention during the update of nr_zones.

These global locks introduces scalability issues (especially the second
one), which slow down code relying on get_online_mems().  This is also a
preparation for not having to rely on get_online_mems() but instead some
more fine grained locks.

The patch moves init_currently_empty_zone under both zone_span_writelock
and pgdat_resize_lock because both the pgdat state is changed (nr_zones)
and the zone's start_pfn.  Also this patch changes the documentation of
node_size_lock to include the protection of nr_zones.

Link: http://lkml.kernel.org/r/20181203205016.14123-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang 
Acked-by: Michal Hocko 
Reviewed-by: Oscar Salvador 
Cc: David Hildenbrand 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, sparse: pass nid instead of pgdat to sparse_add_one_section()

2018-12-28T20:11:49+00:00

Since the information needed in sparse_add_one_section() is node id to
allocate proper memory, it is not necessary to pass its pgdat.

This patch changes the prototype of sparse_add_one_section() to pass node
id directly.  This is intended to reduce misleading that
sparse_add_one_section() would touch pgdat.

Link: http://lkml.kernel.org/r/20181204085657.20472-2-richard.weiyang@gmail.com
Signed-off-by: Wei Yang 
Reviewed-by: David Hildenbrand 
Acked-by: Michal Hocko 
Cc: Dave Hansen 
Cc: Oscar Salvador 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: add nid parameter to arch_remove_memory

2018-12-28T20:11:49+00:00

Patch series "Do not touch pages in hot-remove path", v2.

This patchset aims for two things:

 1) A better definition about offline and hot-remove stage
 2) Solving bugs where we can access non-initialized pages
    during hot-remove operations [2] [3].

This is achieved by moving all page/zone handling to the offline
stage, so we do not need to access pages when hot-removing memory.

[1] https://patchwork.kernel.org/cover/10691415/
[2] https://patchwork.kernel.org/patch/10547445/
[3] https://www.spinics.net/lists/linux-mm/msg161316.html

This patch (of 5):

This is a preparation for the following-up patches.  The idea of passing
the nid is that it will allow us to get rid of the zone parameter
afterwards.

Link: http://lkml.kernel.org/r/20181127162005.15833-2-osalvador@suse.de
Signed-off-by: Oscar Salvador 
Reviewed-by: David Hildenbrand 
Reviewed-by: Pavel Tatashin 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: Jerome Glisse 
Cc: Jonathan Cameron 
Cc: "Rafael J. Wysocki" 

Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory_hotplug: drop "online" parameter from add_memory_resource()

2018-12-28T20:11:48+00:00

Userspace should always be in charge of how to online memory and if memory
should be onlined automatically in the kernel.  Let's drop the parameter
to overwrite this - XEN passes memhp_auto_online, just like add_memory(),
so we can directly use that instead internally.

Link: http://lkml.kernel.org/r/20181123123740.27652-1-david@redhat.com
Signed-off-by: David Hildenbrand 
Acked-by: Michal Hocko 
Reviewed-by: Oscar Salvador 
Acked-by: Juergen Gross 
Cc: Boris Ostrovsky 
Cc: Stefano Stabellini 
Cc: Dan Williams 
Cc: Pavel Tatashin 
Cc: David Hildenbrand 
Cc: Joonsoo Kim 
Cc: Arun KS 
Cc: Mathieu Malaterre 
Cc: Stephen Rothwell 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: do not clear numa_node association after hot_remove

2018-12-28T20:11:47+00:00

Per-cpu numa_node provides a default node for each possible cpu.  The
association gets initialized during the boot when the architecture
specific code explores cpu->NUMA affinity.  When the whole NUMA node is
removed though we are clearing this association

try_offline_node
  check_and_unmap_cpu_on_node
    unmap_cpu_on_node
      numa_clear_node
        numa_set_node(cpu, NUMA_NO_NODE)

This means that whoever calls cpu_to_node for a cpu associated with such a
node will get NUMA_NO_NODE.  This is problematic for two reasons.  First
it is fragile because __alloc_pages_node would simply blow up on an
out-of-bound access.  We have encountered this when loading kvm module

  BUG: unable to handle kernel paging request at 00000000000021c0
  IP: __alloc_pages_nodemask+0x93/0xb70
  PGD 800000ffe853e067 PUD 7336bbc067 PMD 0
  Oops: 0000 [#1] SMP
  [...]
  CPU: 88 PID: 1223749 Comm: modprobe Tainted: G        W          4.4.156-94.64-default #1
  RIP: __alloc_pages_nodemask+0x93/0xb70
  RSP: 0018:ffff887354493b40  EFLAGS: 00010202
  RAX: 00000000000021c0 RBX: 0000000000000000 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: 0000000000000002 RDI: 00000000014000c0
  RBP: 00000000014000c0 R08: ffffffffffffffff R09: 0000000000000000
  R10: ffff88fffc89e790 R11: 0000000000014000 R12: 0000000000000101
  R13: ffffffffa0772cd4 R14: ffffffffa0769ac0 R15: 0000000000000000
  FS:  00007fdf2f2f1700(0000) GS:ffff88fffc880000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000000021c0 CR3: 00000077205ee000 CR4: 0000000000360670
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
    alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
    hardware_setup+0x781/0x849 [kvm_intel]
    kvm_arch_hardware_setup+0x28/0x190 [kvm]
    kvm_init+0x7c/0x2d0 [kvm]
    vmx_init+0x1e/0x32c [kvm_intel]
    do_one_initcall+0xca/0x1f0
    do_init_module+0x5a/0x1d7
    load_module+0x1393/0x1c90
    SYSC_finit_module+0x70/0xa0
    entry_SYSCALL_64_fastpath+0x1e/0xb7
  DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7

on an older kernel but the code is basically the same in the current Linus
tree as well.  alloc_vmcs_cpu could use alloc_pages_nodemask which would
recognize NUMA_NO_NODE and use alloc_pages_node which would translate it
to numa_mem_id but that is wrong as well because it would use a cpu
affinity of the local CPU which might be quite far from the original node.
It is also reasonable to expect that cpu_to_node will provide a sane
value and there might be many more callers like that.

The second problem is that __register_one_node relies on cpu_to_node to
properly associate cpus back to the node when it is onlined.  We do not
want to lose that link as there is no arch independent way to get it from
the early boot time AFAICS.

Drop the whole check_and_unmap_cpu_on_node machinery and keep the
association to fix both issues.  The NODE_DATA(nid) is not deallocated so
it will stay in place and if anybody wants to allocate from that node then
a fallback node will be used.

Thanks to Vlastimil Babka for his live system debugging skills that helped
debugging the issue.

Link: http://lkml.kernel.org/r/20181108100413.966-1-mhocko@kernel.org
Fixes: e13fe8695c57 ("cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node")
Signed-off-by: Michal Hocko 
Debugged-by: Vlastimil Babka 
Reported-by: Miroslav Benes 
Acked-by: Anshuman Khandual 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: only report isolation failures when offlining memory

2018-12-28T20:11:46+00:00

Heiko has complained that his log is swamped by warnings from
has_unmovable_pages

[   20.536664] page dumped because: has_unmovable_pages
[   20.536792] page:000003d081ff4080 count:1 mapcount:0 mapping:000000008ff88600 index:0x0 compound_mapcount: 0
[   20.536794] flags: 0x3fffe0000010200(slab|head)
[   20.536795] raw: 03fffe0000010200 0000000000000100 0000000000000200 000000008ff88600
[   20.536796] raw: 0000000000000000 0020004100000000 ffffffff00000001 0000000000000000
[   20.536797] page dumped because: has_unmovable_pages
[   20.536814] page:000003d0823b0000 count:1 mapcount:0 mapping:0000000000000000 index:0x0
[   20.536815] flags: 0x7fffe0000000000()
[   20.536817] raw: 07fffe0000000000 0000000000000100 0000000000000200 0000000000000000
[   20.536818] raw: 0000000000000000 0000000000000000 ffffffff00000001 0000000000000000

which are not triggered by the memory hotplug but rather CMA allocator.
The original idea behind dumping the page state for all call paths was
that these messages will be helpful debugging failures.  From the above it
seems that this is not the case for the CMA path because we are lacking
much more context.  E.g the second reported page might be a CMA allocated
page.  It is still interesting to see a slab page in the CMA area but it
is hard to tell whether this is bug from the above output alone.

Address this issue by dumping the page state only on request.  Both
start_isolate_page_range and has_unmovable_pages already have an argument
to ignore hwpoison pages so make this argument more generic and turn it
into flags and allow callers to combine non-default modes into a mask.
While we are at it, has_unmovable_pages call from
is_pageblock_removable_nolock (sysfs removable file) is questionable to
report the failure so drop it from there as well.

Link: http://lkml.kernel.org/r/20181218092802.31429-1-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Reported-by: Heiko Carstens 
Reviewed-by: Oscar Salvador 
Cc: Anshuman Khandual 
Cc: Stephen Rothwell 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: be more verbose for memory offline failures

2018-12-28T20:11:46+00:00

There is only very limited information printed when the memory offlining
fails:

[ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed due to signal backoff

This tells us that the failure is triggered by the userspace intervention
but it doesn't tell us much more about the underlying reason.  It might be
that the page migration failes repeatedly and the userspace timeout
expires and send a signal or it might be some of the earlier steps
(isolation, memory notifier) takes too long.

If the migration failes then it would be really helpful to see which page
that and its state.  The same applies to the isolation phase.  If we fail
to isolate a page from the allocator then knowing the state of the page
would be helpful as well.

Dump the page state that fails to get isolated or migrated.  This will
tell us more about the failure and what to focus on during debugging.

[akpm@linux-foundation.org: add missing printk arg]
[mhocko@suse.com: tweak dump_page() `reason' text]
  Link: http://lkml.kernel.org/r/20181116083020.20260-6-mhocko@kernel.org
Link: http://lkml.kernel.org/r/20181107101830.17405-6-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Reviewed-by: Andrew Morton 
Reviewed-by: Oscar Salvador 
Reviewed-by: Anshuman Khandual 
Cc: Baoquan He 
Cc: Oscar Salvador 
Cc: William Kucharski 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds