linux-stable.git/mm/memory_hotplug.c, branch v5.10.78

mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range()

2021-09-22T10:27:59+00:00

commit 7cf209ba8a86410939a24cb1aeb279479a7e0ca6 upstream.

Patch series "mm/memory_hotplug: preparatory patches for new online policy and memory"

These are all cleanups and one fix previously sent as part of [1]:
[PATCH v1 00/12] mm/memory_hotplug: "auto-movable" online policy and memory
groups.

These patches make sense even without the other series, therefore I pulled
them out to make the other series easier to digest.

[1] https://lkml.kernel.org/r/20210607195430.48228-1-david@redhat.com

This patch (of 4):

Checkpatch complained on a follow-up patch that we are using "unsigned"
here, which defaults to "unsigned int" and checkpatch is correct.

As we will search for a fitting zone using the wrong pfn, we might end
up onlining memory to one of the special kernel zones, such as ZONE_DMA,
which can end badly as the onlined memory does not satisfy properties of
these zones.

Use "unsigned long" instead, just as we do in other places when handling
PFNs.  This can bite us once we have physical addresses in the range of
multiple TB.

Link: https://lkml.kernel.org/r/20210712124052.26491-2-david@redhat.com
Fixes: e5e689302633 ("mm, memory_hotplug: display allowed zones in the preferred ordering")
Signed-off-by: David Hildenbrand 
Reviewed-by: Pankaj Gupta 
Reviewed-by: Muchun Song 
Reviewed-by: Oscar Salvador 
Cc: David Hildenbrand 
Cc: Vitaly Kuznetsov 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Pankaj Gupta 
Cc: Wei Yang 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: Anshuman Khandual 
Cc: Dave Hansen 
Cc: Vlastimil Babka 
Cc: Mike Rapoport 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Pavel Tatashin 
Cc: Heiko Carstens 
Cc: Michael Ellerman 
Cc: Catalin Marinas 
Cc: virtualization@lists.linux-foundation.org
Cc: Andy Lutomirski 
Cc: "Aneesh Kumar K.V" 
Cc: Anton Blanchard 
Cc: Ard Biesheuvel 
Cc: Baoquan He 
Cc: Benjamin Herrenschmidt 
Cc: Borislav Petkov 
Cc: Christian Borntraeger 
Cc: Christophe Leroy 
Cc: Dave Jiang 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jia He 
Cc: Joe Perches 
Cc: Kefeng Wang 
Cc: Laurent Dufour 
Cc: Michel Lespinasse 
Cc: Nathan Lynch 
Cc: Nicholas Piggin 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Pierre Morel 
Cc: "Rafael J. Wysocki" 
Cc: Rich Felker 
Cc: Scott Cheloha 
Cc: Sergei Trofimovich 
Cc: Thiago Jung Bauermann 
Cc: Thomas Gleixner 
Cc: Vasily Gorbik 
Cc: Vishal Verma 
Cc: Will Deacon 
Cc: Yoshinori Sato 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: David Hildenbrand 
Signed-off-by: Greg Kroah-Hartman

arm64: mte: Map hotplugged memory as Normal Tagged

2021-03-17T16:06:28+00:00

commit d15dfd31384ba3cb93150e5f87661a76fa419f74 upstream.

In a system supporting MTE, the linear map must allow reading/writing
allocation tags by setting the memory type as Normal Tagged. Currently,
this is only handled for memory present at boot. Hotplugged memory uses
Normal non-Tagged memory.

Introduce pgprot_mhp() for hotplugged memory and use it in
add_memory_resource(). The arm64 code maps pgprot_mhp() to
pgprot_tagged().

Note that ZONE_DEVICE memory should not be mapped as Tagged and
therefore setting the memory type in arch_add_memory() is not feasible.

Signed-off-by: Catalin Marinas 
Fixes: 0178dc761368 ("arm64: mte: Use Normal Tagged attributes for the linear map")
Reported-by: Patrick Daly 
Tested-by: Patrick Daly 
Link: https://lore.kernel.org/r/1614745263-27827-1-git-send-email-pdaly@codeaurora.org
Cc:  # 5.10.x
Cc: Will Deacon 
Cc: Andrew Morton 
Cc: Vincenzo Frascino 
Cc: David Hildenbrand 
Reviewed-by: David Hildenbrand 
Reviewed-by: Vincenzo Frascino 
Reviewed-by: Anshuman Khandual 
Link: https://lore.kernel.org/r/20210309122601.5543-1-catalin.marinas@arm.com
Signed-off-by: Will Deacon 
Signed-off-by: Greg Kroah-Hartman

mm: memmap defer init doesn't work as expected

2021-01-06T13:56:50+00:00

commit dc2da7b45ffe954a0090f5d0310ed7b0b37d2bd2 upstream.

VMware observed a performance regression during memmap init on their
platform, and bisected to commit 73a6e474cb376 ("mm: memmap_init:
iterate over memblock regions rather that check each PFN") causing it.

Before the commit:

  [0.033176] Normal zone: 1445888 pages used for memmap
  [0.033176] Normal zone: 89391104 pages, LIFO batch:63
  [0.035851] ACPI: PM-Timer IO Port: 0x448

With commit

  [0.026874] Normal zone: 1445888 pages used for memmap
  [0.026875] Normal zone: 89391104 pages, LIFO batch:63
  [2.028450] ACPI: PM-Timer IO Port: 0x448

The root cause is the current memmap defer init doesn't work as expected.

Before, memmap_init_zone() was used to do memmap init of one whole zone,
to initialize all low zones of one numa node, but defer memmap init of
the last zone in that numa node.  However, since commit 73a6e474cb376,
function memmap_init() is adapted to iterater over memblock regions
inside one zone, then call memmap_init_zone() to do memmap init for each
region.

E.g, on VMware's system, the memory layout is as below, there are two
memory regions in node 2.  The current code will mistakenly initialize the
whole 1st region [mem 0xab00000000-0xfcffffffff], then do memmap defer to
iniatialize only one memmory section on the 2nd region [mem
0x10000000000-0x1033fffffff].  In fact, we only expect to see that there's
only one memory section's memmap initialized.  That's why more time is
costed at the time.

[    0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[    0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
[    0.008843] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x55ffffffff]
[    0.008844] ACPI: SRAT: Node 1 PXM 1 [mem 0x5600000000-0xaaffffffff]
[    0.008844] ACPI: SRAT: Node 2 PXM 2 [mem 0xab00000000-0xfcffffffff]
[    0.008845] ACPI: SRAT: Node 2 PXM 2 [mem 0x10000000000-0x1033fffffff]

Now, let's add a parameter 'zone_end_pfn' to memmap_init_zone() to pass
down the real zone end pfn so that defer_init() can use it to judge
whether defer need be taken in zone wide.

Link: https://lkml.kernel.org/r/20201223080811.16211-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20201223080811.16211-2-bhe@redhat.com
Fixes: commit 73a6e474cb376 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
Signed-off-by: Baoquan He 
Reported-by: Rahul Gopakumar 
Reviewed-by: Mike Rapoport 
Cc: David Hildenbrand 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm/rmap: always do TTU_IGNORE_ACCESS

2020-12-30T10:53:55+00:00

[ Upstream commit 013339df116c2ee0d796dd8bfb8f293a2030c063 ]

Since commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic
v2"), the code to check the secondary MMU's page table access bit is
broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the
secondary MMU's page table before the check.  More specifically for those
secondary MMUs which unmap the memory in
mmu_notifier_invalidate_range_start() like kvm.

However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the
absence of TTU_IGNORE_ACCESS and it explicitly performs the page table
access check before trying to unmap the page.  So, at worst the reclaim
will miss accesses in a very short window if we remove page table access
check in unmapping code.

There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg
reclaim.  From memcg reclaim the page_referenced() only account the
accesses from the processes which are in the same memcg of the target page
but the unmapping code is considering accesses from all the processes, so,
decreasing the effectiveness of memcg reclaim.

The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping
code.

Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com
Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
Signed-off-by: Shakeel Butt 
Acked-by: Johannes Weiner 
Cc: Hugh Dickins 
Cc: Jerome Glisse 
Cc: Vlastimil Babka 
Cc: Michal Hocko 
Cc: Andrea Arcangeli 
Cc: Dan Williams 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin

mm: fix phys_to_target_node() and memory_add_physaddr_to_nid() exports

2020-11-22T18:48:22+00:00

The core-mm has a default __weak implementation of phys_to_target_node()
to mirror the weak definition of memory_add_physaddr_to_nid().  That
symbol is exported for modules.  However, while the export in
mm/memory_hotplug.c exported the symbol in the configuration cases of:

	CONFIG_NUMA_KEEP_MEMINFO=y
	CONFIG_MEMORY_HOTPLUG=y

...and:

	CONFIG_NUMA_KEEP_MEMINFO=n
	CONFIG_MEMORY_HOTPLUG=y

...it failed to export the symbol in the case of:

	CONFIG_NUMA_KEEP_MEMINFO=y
	CONFIG_MEMORY_HOTPLUG=n

Not only is that broken, but Christoph points out that the kernel should
not be exporting any __weak symbol, which means that
memory_add_physaddr_to_nid() example that phys_to_target_node() copied
is broken too.

Rework the definition of phys_to_target_node() and
memory_add_physaddr_to_nid() to not require weak symbols.  Move to the
common arch override design-pattern of an asm header defining a symbol
to replace the default implementation.

The only common header that all memory_add_physaddr_to_nid() producing
architectures implement is asm/sparsemem.h.  In fact, powerpc already
defines its memory_add_physaddr_to_nid() helper in sparsemem.h.
Double-down on that observation and define phys_to_target_node() where
necessary in asm/sparsemem.h.  An alternate consideration that was
discarded was to put this override in asm/numa.h, but that entangles
with the definition of MAX_NUMNODES relative to the inclusion of
linux/nodemask.h, and requires powerpc to grow a new header.

The dependency on NUMA_KEEP_MEMINFO for DEV_DAX_HMEM_DEVICES is invalid
now that the symbol is properly exported / stubbed in all combinations
of CONFIG_NUMA_KEEP_MEMINFO and CONFIG_MEMORY_HOTPLUG.

[dan.j.williams@intel.com: v4]
  Link: https://lkml.kernel.org/r/160461461867.1505359.5301571728749534585.stgit@dwillia2-desk3.amr.corp.intel.com
[dan.j.williams@intel.com: powerpc: fix create_section_mapping compile warning]
  Link: https://lkml.kernel.org/r/160558386174.2948926.2740149041249041764.stgit@dwillia2-desk3.amr.corp.intel.com

Fixes: a035b6bf863e ("mm/memory_hotplug: introduce default phys_to_target_node() implementation")
Reported-by: Randy Dunlap 
Reported-by: Thomas Gleixner 
Reported-by: kernel test robot 
Reported-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
Signed-off-by: Andrew Morton 
Tested-by: Randy Dunlap 
Tested-by: Thomas Gleixner 
Reviewed-by: Thomas Gleixner 
Reviewed-by: Christoph Hellwig 
Cc: Joao Martins 
Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Vishal Verma 
Cc: Stephen Rothwell 
Link: https://lkml.kernel.org/r/160447639846.1133764.7044090803980177548.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds

mm/memory_hotplug: remove a wrapper for alloc_migration_target()

2020-10-18T16:27:09+00:00

To calculate the correct node to migrate the page for hotplug, we need to
check node id of the page.  Wrapper for alloc_migration_target() exists
for this purpose.

However, Vlastimil informs that all migration source pages come from a
single node.  In this case, we don't need to check the node id for each
page and we don't need to re-set the target nodemask for each page by
using the wrapper.  Set up the migration_target_control once and use it
for all pages.

Signed-off-by: Joonsoo Kim 
Signed-off-by: Andrew Morton 
Acked-by: Vlastimil Babka 
Acked-by: Michal Hocko 
Cc: Christoph Hellwig 
Cc: Mike Kravetz 
Cc: Naoya Horiguchi 
Cc: Roman Gushchin 
Link: http://lkml.kernel.org/r/1594622517-20681-10-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds

mm/memory_hotplug: update comment regarding zone shuffling

2020-10-16T18:11:18+00:00

As we no longer shuffle via generic_online_page() and when undoing
isolation, we can simplify the comment.

We now effectively shuffle only once (properly) when onlining new memory.

Signed-off-by: David Hildenbrand 
Signed-off-by: Andrew Morton 
Reviewed-by: Wei Yang 
Acked-by: Michal Hocko 
Cc: Alexander Duyck 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Dave Hansen 
Cc: Vlastimil Babka 
Cc: Wei Yang 
Cc: Oscar Salvador 
Cc: Mike Rapoport 
Cc: Pankaj Gupta 
Cc: Haiyang Zhang 
Cc: "K. Y. Srinivasan" 
Cc: Matthew Wilcox 
Cc: Michael Ellerman 
Cc: Scott Cheloha 
Cc: Stephen Hemminger 
Cc: Wei Liu 
Link: https://lkml.kernel.org/r/20201005121534.15649-6-david@redhat.com
Signed-off-by: Linus Torvalds

mm: don't panic when links can't be created in sysfs

2020-10-16T18:11:18+00:00

At boot time, or when doing memory hot-add operations, if the links in
sysfs can't be created, the system is still able to run, so just report
the error in the kernel log rather than BUG_ON and potentially make system
unusable because the callpath can be called with locks held.

Since the number of memory blocks managed could be high, the messages are
rate limited.

As a consequence, link_mem_sections() has no status to report anymore.

Signed-off-by: Laurent Dufour 
Signed-off-by: Andrew Morton 
Reviewed-by: Oscar Salvador 
Acked-by: Michal Hocko 
Acked-by: David Hildenbrand 
Cc: Greg Kroah-Hartman 
Cc: Fenghua Yu 
Cc: Nathan Lynch 
Cc: "Rafael J . Wysocki" 
Cc: Scott Cheloha 
Cc: Tony Luck 
Link: https://lkml.kernel.org/r/20200915094143.79181-4-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds

kernel/resource: make iomem_resource implicit in release_mem_region_adjustable()

2020-10-16T18:11:18+00:00

"mem" in the name already indicates the root, similar to
release_mem_region() and devm_request_mem_region().  Make it implicit.
The only single caller always passes iomem_resource, other parents are not
applicable.

Suggested-by: Wei Yang 
Signed-off-by: David Hildenbrand 
Signed-off-by: Andrew Morton 
Reviewed-by: Wei Yang 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: Jason Gunthorpe 
Cc: Kees Cook 
Cc: Ard Biesheuvel 
Cc: Pankaj Gupta 
Cc: Baoquan He 
Link: https://lkml.kernel.org/r/20200916073041.10355-1-david@redhat.com
Signed-off-by: Linus Torvalds

mm/memory_hotplug: MEMHP_MERGE_RESOURCE to specify merging of System RAM resources

2020-10-16T18:11:18+00:00

Some add_memory*() users add memory in small, contiguous memory blocks.
Examples include virtio-mem, hyper-v balloon, and the XEN balloon.

This can quickly result in a lot of memory resources, whereby the actual
resource boundaries are not of interest (e.g., it might be relevant for
DIMMs, exposed via /proc/iomem to user space).  We really want to merge
added resources in this scenario where possible.

Let's provide a flag (MEMHP_MERGE_RESOURCE) to specify that a resource
either created within add_memory*() or passed via add_memory_resource()
shall be marked mergeable and merged with applicable siblings.

To implement that, we need a kernel/resource interface to mark selected
System RAM resources mergeable (IORESOURCE_SYSRAM_MERGEABLE) and trigger
merging.

Note: We really want to merge after the whole operation succeeded, not
directly when adding a resource to the resource tree (it would break
add_memory_resource() and require splitting resources again when the
operation failed - e.g., due to -ENOMEM).

Signed-off-by: David Hildenbrand 
Signed-off-by: Andrew Morton 
Reviewed-by: Pankaj Gupta 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: Jason Gunthorpe 
Cc: Kees Cook 
Cc: Ard Biesheuvel 
Cc: Thomas Gleixner 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Wei Liu 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Roger Pau Monné 
Cc: Julien Grall 
Cc: Baoquan He 
Cc: Wei Yang 
Cc: Anton Blanchard 
Cc: Benjamin Herrenschmidt 
Cc: Christian Borntraeger 
Cc: Dave Jiang 
Cc: Eric Biederman 
Cc: Greg Kroah-Hartman 
Cc: Heiko Carstens 
Cc: Jason Wang 
Cc: Len Brown 
Cc: Leonardo Bras 
Cc: Libor Pechacek 
Cc: Michael Ellerman 
Cc: "Michael S. Tsirkin" 
Cc: Nathan Lynch 
Cc: "Oliver O'Halloran" 
Cc: Paul Mackerras 
Cc: Pingfan Liu 
Cc: "Rafael J. Wysocki" 
Cc: Vasily Gorbik 
Cc: Vishal Verma 
Link: https://lkml.kernel.org/r/20200911103459.10306-6-david@redhat.com
Signed-off-by: Linus Torvalds