linux.git/mm/memory_hotplug.c, branch v4.13-rc2

mm/memory-hotplug: switch locking to a percpu rwsem

2017-07-10T23:32:33+00:00

Andrey reported a potential deadlock with the memory hotplug lock and
the cpu hotplug lock.

The reason is that memory hotplug takes the memory hotplug lock and then
calls stop_machine() which calls get_online_cpus().  That's the reverse
lock order to get_online_cpus(); get_online_mems(); in mm/slub_common.c

The problem has been there forever.  The reason why this was never
reported is that the cpu hotplug locking had this homebrewn recursive
reader writer semaphore construct which due to the recursion evaded the
full lock dep coverage.  The memory hotplug code copied that construct
verbatim and therefor has similar issues.

Three steps to fix this:

1) Convert the memory hotplug locking to a per cpu rwsem so the
   potential issues get reported proper by lockdep.

2) Lock the online cpus in mem_hotplug_begin() before taking the memory
   hotplug rwsem and use stop_machine_cpuslocked() in the page_alloc
   code to avoid recursive locking.

3) The cpu hotpluck locking in #2 causes a recursive locking of the cpu
   hotplug lock via __offline_pages() -> lru_add_drain_all(). Solve this
   by invoking lru_add_drain_all_cpuslocked() instead.

Link: http://lkml.kernel.org/r/20170704093421.506836322@linutronix.de
Reported-by: Andrey Ryabinin 
Signed-off-by: Thomas Gleixner 
Acked-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Cc: Vladimir Davydov 
Cc: Peter Zijlstra 
Cc: Davidlohr Bueso 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory_hotplug.c: remove unused local zone_type from __remove_zone()

2017-07-10T23:32:32+00:00

__remove_zone() sets up up zone_type, but never uses it for anything.
This does not cause a warning, due to the (necessary) use of
-Wno-unused-but-set-variable.  However, it's noise, so just delete it.

Link: http://lkml.kernel.org/r/20170624043421.24465-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard 
Acked-by: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: unify new_node_page and alloc_migrate_target

2017-07-10T23:32:31+00:00

Commit 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest
neighbor node when mem-offline") has duplicated a large part of
alloc_migrate_target with some hotplug specific special casing.

To be more precise it tried to enfore the allocation from a different
node than the original page.  As a result the two function diverged in
their shared logic, e.g.  the hugetlb allocation strategy.

Let's unify the two and express different NUMA requirements by the given
nodemask.  new_node_page will simply exclude the node it doesn't care
about and alloc_migrate_target will use all the available nodes.
alloc_migrate_target will then learn to migrate hugetlb pages more
sanely and use preallocated pool when possible.

Please note that alloc_migrate_target used to call alloc_page resp.
alloc_pages_current so the memory policy of the current context which is
quite strange when we consider that it is used in the context of
alloc_contig_range which just tries to migrate pages which stand in the
way.

Link: http://lkml.kernel.org/r/20170608074553.22152-4-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Cc: Naoya Horiguchi 
Cc: Xishi Qiu 
Cc: zhong jiang 
Cc: Joonsoo Kim 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

hugetlb, memory_hotplug: prefer to use reserved pages for migration

2017-07-10T23:32:31+00:00

new_node_page will try to use the origin's next NUMA node as the
migration destination for hugetlb pages.  If such a node doesn't have
any preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol
to allocate a surplus page instead.  This is quite subotpimal for any
configuration when hugetlb pages are no distributed to all NUMA nodes
evenly.  Say we have a hotplugable node 4 and spare hugetlb pages are
node 0

  /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000
  /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
  /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0
  /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0
  /sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000
  /sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0
  /sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0
  /sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0

Now we consume the whole pool on node 4 and try to offline this node.
All the allocated pages should be moved to node0 which has enough
preallocated pages to hold them.  With the current implementation
offlining very likely fails because hugetlb allocations during runtime
are much less reliable.

Fix this by reusing the nodemask which excludes migration source and try
to find a first node which has a page in the preallocated pool first and
fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is
consumed.

[akpm@linux-foundation.org: remove bogus arg from alloc_huge_page_nodemask() stub]
Link: http://lkml.kernel.org/r/20170608074553.22152-3-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Cc: Naoya Horiguchi 
Cc: Xishi Qiu 
Cc: zhong jiang 
Cc: Joonsoo Kim 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: simplify empty node mask handling in new_node_page

2017-07-10T23:32:31+00:00

new_node_page tries to allocate the target page on a different NUMA node
than the source page.  This makes sense in most cases during the hotplug
because we are likely to offline the whole numa node.  But there are
cases where there are no other nodes to fallback (e.g.  when offlining
parts of the only existing node) and we have to fallback to allocating
from the source node.  The current code does that but it can be
simplified by checking the nmask and updating it before we even try to
allocate rather than special casing it.

This patch shouldn't introduce any functional change.

Link: http://lkml.kernel.org/r/20170608074553.22152-2-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Cc: Naoya Horiguchi 
Cc: Xishi Qiu 
Cc: zhong jiang 
Cc: Joonsoo Kim 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: support movable_node for hotpluggable nodes

2017-07-10T23:32:31+00:00

movable_node kernel parameter allows making hotpluggable NUMA nodes to
put all the hotplugable memory into movable zone which allows more or
less reliable memory hotremove.  At least this is the case for the NUMA
nodes present during the boot (see find_zone_movable_pfns_for_nodes).

This is not the case for the memory hotplug, though.

	echo online > /sys/devices/system/memory/memoryXYZ/state

will default to a kernel zone (usually ZONE_NORMAL) unless the
particular memblock is already in the movable zone range which is not
the case normally when onlining the memory from the udev rule context
for a freshly hotadded NUMA node.  The only option currently is to have
a special udev rule to echo online_movable to all memblocks belonging to
such a node which is rather clumsy.  Not to mention this is inconsistent
as well because what ended up in the movable zone during the boot will
end up in a kernel zone after hotremove & hotadd without special care.

It would be nice to reuse memblock_is_hotpluggable but the runtime
hotplug doesn't have that information available because the boot and
hotplug paths are not shared and it would be really non trivial to make
them use the same code path because the runtime hotplug doesn't play
with the memblock allocator at all.

Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
movable_node is enabled and the range doesn't overlap with the existing
normal zone.  This should provide a reasonable default onlining
strategy.

Strictly speaking the semantic is not identical with the boot time
initialization because find_zone_movable_pfns_for_nodes covers only the
hotplugable range as described by the BIOS/FW.  From my experience this
is usually a full node though (except for Node0 which is special and
never goes away completely).  If this turns out to be a problem in the
real life we can tweak the code to store hotplug flag into memblocks but
let's keep this simple now.

Link: http://lkml.kernel.org/r/20170612111227.GI7476@dhcp22.suse.cz
Signed-off-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Acked-by: Reza Arbab 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Yasuaki Ishimatsu 
Cc: 
Cc: Kani Toshimitsu 
Cc: 
Cc: Joonsoo Kim 
Cc: Andi Kleen 
Cc: David Rientjes 
Cc: Daniel Kiper 
Cc: Igor Mammedov 
Cc: Vitaly Kuznetsov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory_hotplug.c: add NULL check to avoid potential NULL pointer dereference

2017-07-10T23:32:30+00:00

The NULL check at line 1226: if (!pgdat), implies that pointer pgdat
might be NULL.

rollback_node_hotadd() dereferences this pointer.  Add NULL check to
avoid a potential NULL pointer dereference.

Addresses-Coverity-ID: 1369133
Link: http://lkml.kernel.org/r/20170530212436.GA6195@embeddedgus
Signed-off-by: Gustavo A. R. Silva 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: move movable_node to the hotplug proper

2017-07-06T23:24:35+00:00

movable_node_is_enabled is defined in memblock proper while it is
initialized from the memory hotplug proper.  This is quite messy and it
makes a dependency between the two so move movable_node along with the
helper functions to memory_hotplug.

To make it more entertaining the kernel parameter is ignored unless
CONFIG_HAVE_MEMBLOCK_NODE_MAP=y because we do not have the node
information for each memblock otherwise.  So let's warn when the option
is disabled.

Link: http://lkml.kernel.org/r/20170529114141.536-4-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Acked-by: Reza Arbab 
Acked-by: Vlastimil Babka 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Jerome Glisse 
Cc: Yasuaki Ishimatsu 
Cc: Xishi Qiu 
Cc: Kani Toshimitsu 
Cc: Chen Yucong 
Cc: Joonsoo Kim 
Cc: Andi Kleen 
Cc: David Rientjes 
Cc: Daniel Kiper 
Cc: Igor Mammedov 
Cc: Vitaly Kuznetsov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: drop CONFIG_MOVABLE_NODE

2017-07-06T23:24:35+00:00

Commit 20b2f52b73fe ("numa: add CONFIG_MOVABLE_NODE for
movable-dedicated node") has introduced CONFIG_MOVABLE_NODE without a
good explanation on why it is actually useful.

It makes a lot of sense to make movable node semantic opt in but we
already have that because the feature has to be explicitly enabled on
the kernel command line.  A config option on top only makes the
configuration space larger without a good reason.  It also adds an
additional ifdefery that pollutes the code.

Just drop the config option and make it de-facto always enabled.  This
shouldn't introduce any change to the semantic.

Link: http://lkml.kernel.org/r/20170529114141.536-3-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Acked-by: Reza Arbab 
Acked-by: Vlastimil Babka 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Jerome Glisse 
Cc: Yasuaki Ishimatsu 
Cc: Xishi Qiu 
Cc: Kani Toshimitsu 
Cc: Chen Yucong 
Cc: Joonsoo Kim 
Cc: Andi Kleen 
Cc: David Rientjes 
Cc: Daniel Kiper 
Cc: Igor Mammedov 
Cc: Vitaly Kuznetsov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memory_hotplug: drop artificial restriction on online/offline

2017-07-06T23:24:35+00:00

Patch series "remove CONFIG_MOVABLE_NODE".

I am continuing to clean up the memory hotplug code and
CONFIG_MOVABLE_NODE seems dubious at best.  The following two patches
simply removes the flag and make it de-facto always enabled.

The current semantic of the config option is twofold 1) it automatically
binds hotplugable nodes to have memory in zone_movable by default when
movable_node is enabled 2) forbids memory hotplug to online all the
memory as movable when !CONFIG_MOVABLE_NODE.

The later restriction is quite dubious because there is no clear cut of
how much normal memory do we need for a reasonable system operation.  A
single memory block which is sufficient to allow further movable onlines
is far from sufficient (e.g a node with >2GB and memblocks 128MB will
fill up this zone with struct pages leaving nothing for other
allocations).  Removing the config option will not only reduce the
configuration space it also removes quite some code.

The semantic of the movable_node command line parameter is preserved.

The first patch removes the restriction mentioned above and the second
one simply removes all the CONFIG_MOVABLE_NODE related stuff.  The last
patch moves movable_node flag handling to memory_hotplug proper where it
belongs.

[1] http://lkml.kernel.org/r/20170524122411.25212-1-mhocko@kernel.org

This patch (of 3):

Commit 74d42d8fe146 ("memory_hotplug: ensure every online node has
NORMAL memory") has introduced a restriction that every numa node has to
have at least some memory in !movable zones before a first movable
memory can be onlined if !CONFIG_MOVABLE_NODE.

Likewise can_offline_normal checks the amount of normal memory in
!movable zones and it disallows to offline memory if there is no normal
memory left with a justification that "memory-management acts bad when
we have nodes which is online but don't have any normal memory".

While it is true that not having _any_ memory for kernel allocations on
a NUMA node is far from great and such a node would be quite subotimal
because all kernel allocations will have to fallback to another NUMA
node but there is no reason to disallow such a configuration in
principle.

Besides that there is not really a big difference to have one memblock
for ZONE_NORMAL available or none.  With 128MB size memblocks the system
might trash on the kernel allocations requests anyway.  It is really
hard to draw a line on how much normal memory is really sufficient so we
have to rely on administrator to configure system sanely therefore drop
the artificial restriction and remove can_offline_normal and
can_online_high_movable altogether.

Link: http://lkml.kernel.org/r/20170529114141.536-2-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Acked-by: Reza Arbab 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Jerome Glisse 
Cc: Yasuaki Ishimatsu 
Cc: Xishi Qiu 
Cc: Kani Toshimitsu 
Cc: Chen Yucong 
Cc: Joonsoo Kim 
Cc: Andi Kleen 
Cc: David Rientjes 
Cc: Daniel Kiper 
Cc: Igor Mammedov 
Cc: Vitaly Kuznetsov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds