linux.git/mm/memory_hotplug.c, branch v6.11

x86/kaslr: Expose and use the end of the physical memory address space

2024-08-20T11:44:57+00:00

iounmap() on x86 occasionally fails to unmap because the provided valid
ioremap address is not below high_memory. It turned out that this
happens due to KASLR.

KASLR uses the full address space between PAGE_OFFSET and vaddr_end to
randomize the starting points of the direct map, vmalloc and vmemmap
regions.  It thereby limits the size of the direct map by using the
installed memory size plus an extra configurable margin for hot-plug
memory.  This limitation is done to gain more randomization space
because otherwise only the holes between the direct map, vmalloc,
vmemmap and vaddr_end would be usable for randomizing.

The limited direct map size is not exposed to the rest of the kernel, so
the memory hot-plug and resource management related code paths still
operate under the assumption that the available address space can be
determined with MAX_PHYSMEM_BITS.

request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1
downwards.  That means the first allocation happens past the end of the
direct map and if unlucky this address is in the vmalloc space, which
causes high_memory to become greater than VMALLOC_START and consequently
causes iounmap() to fail for valid ioremap addresses.

MAX_PHYSMEM_BITS cannot be changed for that because the randomization
does not align with address bit boundaries and there are other places
which actually require to know the maximum number of address bits.  All
remaining usage sites of MAX_PHYSMEM_BITS have been analyzed and found
to be correct.

Cure this by exposing the end of the direct map via PHYSMEM_END and use
that for the memory hot-plug and resource management related places
instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END
maps to a variable which is initialized by the KASLR initialization and
otherwise it is based on MAX_PHYSMEM_BITS as before.

To prevent future hickups add a check into add_pages() to catch callers
trying to add memory above PHYSMEM_END.

Fixes: 0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions")
Reported-by: Max Ramanouski 
Reported-by: Alistair Popple 
Signed-off-by: Thomas Gleixner 
Tested-By: Max Ramanouski 
Tested-by: Alistair Popple 
Reviewed-by: Dan Williams 
Reviewed-by: Alistair Popple 
Reviewed-by: Kees Cook 
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/87ed6soy3z.ffs@tglx

mm/page_alloc: put __free_pages_core() in __meminit section

2024-07-12T22:52:21+00:00

__free_pages_core() is only used in bootmem init and hot-add memory init
path.  Let's put it in __meminit section.

Link: https://lkml.kernel.org/r/20240706061615.30322-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang 
Acked-by: David Hildenbrand 
Cc: Oscar Salvador 
Signed-off-by: Andrew Morton

mm/memory_hotplug: skip adjust_managed_page_count() for PageOffline() pages when offlining

2024-07-04T02:30:18+00:00

We currently have a hack for virtio-mem in place to handle memory
offlining with PageOffline pages for which we already adjusted the managed
page count.

Let's enlighten memory offlining code so we can get rid of that hack, and
document the situation.

Link: https://lkml.kernel.org/r/20240607090939.89524-4-david@redhat.com
Signed-off-by: David Hildenbrand 
Acked-by: Oscar Salvador 
Cc: Alexander Potapenko 
Cc: Dexuan Cui 
Cc: Dmitry Vyukov 
Cc: Eugenio Pérez 
Cc: Haiyang Zhang 
Cc: Jason Wang 
Cc: Juergen Gross 
Cc: "K. Y. Srinivasan" 
Cc: Marco Elver 
Cc: Michael S. Tsirkin 
Cc: Mike Rapoport (IBM) 
Cc: Oleksandr Tyshchenko 
Cc: Stefano Stabellini 
Cc: Wei Liu 
Cc: Xuan Zhuo 
Signed-off-by: Andrew Morton

mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved()

2024-07-04T02:30:18+00:00

We currently initialize the memmap such that PG_reserved is set and the
refcount of the page is 1.  In virtio-mem code, we have to manually clear
that PG_reserved flag to make memory offlining with partially hotplugged
memory blocks possible: has_unmovable_pages() would otherwise bail out on
such pages.

We want to avoid PG_reserved where possible and move to typed pages
instead.  Further, we want to further enlighten memory offlining code
about PG_offline: offline pages in an online memory section.  One example
is handling managed page count adjustments in a cleaner way during memory
offlining.

So let's initialize the pages with PG_offline instead of PG_reserved. 
generic_online_page()->__free_pages_core() will now clear that flag before
handing that memory to the buddy.

Note that the page refcount is still 1 and would forbid offlining of such
memory except when special care is take during GOING_OFFLINE as currently
only implemented by virtio-mem.

With this change, we can now get non-PageReserved() pages in the XEN
balloon list.  From what I can tell, that can already happen via
decrease_reservation(), so that should be fine.

HV-balloon should not really observe a change: partial online memory
blocks still cannot get surprise-offlined, because the refcount of these
PageOffline() pages is 1.

Update virtio-mem, HV-balloon and XEN-balloon code to be aware that
hotplugged pages are now PageOffline() instead of PageReserved() before
they are handed over to the buddy.

We'll leave the ZONE_DEVICE case alone for now.

Note that self-hosted vmemmap pages will no longer be marked as
reserved.  This matches ordinary vmemmap pages allocated from the buddy
during memory hotplug.  Now, really only vmemmap pages allocated from
memblock during early boot will be marked reserved.  Existing
PageReserved() checks seem to be handling all relevant cases correctly
even after this change.

Link: https://lkml.kernel.org/r/20240607090939.89524-3-david@redhat.com
Signed-off-by: David Hildenbrand 
Acked-by: Oscar Salvador  [generic memory-hotplug bits]
Cc: Alexander Potapenko 
Cc: Dexuan Cui 
Cc: Dmitry Vyukov 
Cc: Eugenio Pérez 
Cc: Haiyang Zhang 
Cc: Jason Wang 
Cc: Juergen Gross 
Cc: "K. Y. Srinivasan" 
Cc: Marco Elver 
Cc: Michael S. Tsirkin 
Cc: Mike Rapoport (IBM) 
Cc: Oleksandr Tyshchenko 
Cc: Stefano Stabellini 
Cc: Wei Liu 
Cc: Xuan Zhuo 
Signed-off-by: Andrew Morton

mm: pass meminit_context to __free_pages_core()

2024-07-04T02:30:18+00:00

Patch series "mm/memory_hotplug: use PageOffline() instead of
PageReserved() for !ZONE_DEVICE".

This can be a considered a long-overdue follow-up to some parts of [1]. 
The patches are based on [2], but they are not strictly required -- just
makes it clearer why we can use adjust_managed_page_count() for memory
hotplug without going into details about highmem.

We stop initializing pages with PageReserved() in memory hotplug code --
except when dealing with ZONE_DEVICE for now.  Instead, we use
PageOffline(): all pages are initialized to PageOffline() when onlining a
memory section, and only the ones actually getting exposed to the
system/page allocator will get PageOffline cleared.

This way, we enlighten memory hotplug more about PageOffline() pages and
can cleanup some hacks we have in virtio-mem code.

What about ZONE_DEVICE?  PageOffline() is wrong, but we might just stop
using PageReserved() for them later by simply checking for
is_zone_device_page() at suitable places.  That will be a separate patch
set / proposal.

This primarily affects virtio-mem, HV-balloon and XEN balloon. I only
briefly tested with virtio-mem, which benefits most from these cleanups.

[1] https://lore.kernel.org/all/20191024120938.11237-1-david@redhat.com/
[2] https://lkml.kernel.org/r/20240607083711.62833-1-david@redhat.com


This patch (of 3):

In preparation for further changes, let's teach __free_pages_core() about
the differences of memory hotplug handling.

Move the memory hotplug specific handling from generic_online_page() to
__free_pages_core(), use adjust_managed_page_count() on the memory hotplug
path, and spell out why memory freed via memblock cannot currently use
adjust_managed_page_count().

[david@redhat.com: add missed CONFIG_DEFERRED_STRUCT_PAGE_INIT]
  Link: https://lkml.kernel.org/r/b72e6efd-fb0a-459c-b1a0-88a98e5b19e2@redhat.com
[david@redhat.com: fix up the memblock comment, per Oscar]
  Link: https://lkml.kernel.org/r/2ed64218-7f3b-4302-a5dc-27f060654fe2@redhat.com
[david@redhat.com: add the parameter name also in the declaration]
  Link: https://lkml.kernel.org/r/ca575956-f0dd-4fb9-a307-6b7621681ed9@redhat.com
Link: https://lkml.kernel.org/r/20240607090939.89524-1-david@redhat.com
Link: https://lkml.kernel.org/r/20240607090939.89524-2-david@redhat.com
Signed-off-by: David Hildenbrand 
Cc: Alexander Potapenko 
Cc: Dexuan Cui 
Cc: Dmitry Vyukov 
Cc: Eugenio Pérez 
Cc: Haiyang Zhang 
Cc: Jason Wang 
Cc: Juergen Gross 
Cc: "K. Y. Srinivasan" 
Cc: Marco Elver 
Cc: Michael S. Tsirkin 
Cc: Mike Rapoport (IBM) 
Cc: Oleksandr Tyshchenko 
Cc: Oscar Salvador 
Cc: Stefano Stabellini 
Cc: Wei Liu 
Cc: Xuan Zhuo 
Signed-off-by: Andrew Morton

mm/memory_hotplug: prevent accessing by index=-1

2024-07-04T02:30:09+00:00

nid may be equal to NUMA_NO_NODE=-1.  Prevent accessing node_data array by
invalid index with check for nid.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Link: https://lkml.kernel.org/r/20240606080659.18525-1-abelova@astralinux.ru
Fixes: e83a437faa62 ("mm/memory_hotplug: introduce "auto-movable" online policy")
Signed-off-by: Anastasia Belova 
Acked-by: David Hildenbrand 
Acked-by: Oscar Salvador 
Signed-off-by: Andrew Morton

mm/memory_hotplug: drop memblock_phys_free() call in try_remove_memory()

2024-07-04T02:30:04+00:00

The call for memblock_phys_free() in try_remove_memory() does not balance
any call to memblock_alloc() (or memblock_reserve() for that matter).

There are no memblock_reserve() calls in mm/memory_hotplug.c, no memblock
allocations possible after mm_core_init(), and even if memblock_add_node()
called from add_memory_resource() would need to allocate memory, that
memory would ba allocated from slab.

The patch f9126ab9241f ("memory-hotplug: fix wrong edge when hot add a new
node") that introduced that call to memblock_free() does not provide
adequate description why that was required and tinkering with memblock in
the context of memory hotplug on x86 seems bogus because x86 never kept
memblock after boot anyway.

Drop memblock_phys_free() call in try_remove_memory().

[rppt@kernel.org: rewrite the commit message]
Link: https://lkml.kernel.org/r/20240605082049.973242-1-rppt@kernel.org
Signed-off-by: Jonathan Cameron 
Signed-off-by: Mike Rapoport (IBM) 
Acked-by: David Hildenbrand 
Acked-by: Oscar Salvador 
Signed-off-by: Andrew Morton

mm/hugetlb: mm/memory_hotplug: use a folio in scan_movable_pages()

2024-07-04T02:30:02+00:00

By using a folio in scan_movable_pages() we convert the last user of the
page-based hugetlb information macro functions to the folio version. 
After this conversion, we can safely remove the page-based definitions
from include/linux/hugetlb.h.

Link: https://lkml.kernel.org/r/20240530171427.242018-1-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar 
Acked-by: David Hildenbrand 
Cc: Matthew Wilcox (Oracle) 
Cc: Muchun Song 
Cc: Oscar Salvador 
Signed-off-by: Andrew Morton

mm/hugetlb: rename dissolve_free_huge_pages() to dissolve_free_hugetlb_folios()

2024-05-06T00:53:35+00:00

dissolve_free_huge_pages() only uses folios internally, rename it to
dissolve_free_hugetlb_folios() and change the comments which reference it.

[akpm@linux-foundation.org: remove unneeded `extern']
Link: https://lkml.kernel.org/r/20240412182139.120871-2-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar 
Reviewed-by: Vishal Moola (Oracle) 
Reviewed-by: Miaohe Lin 
Cc: Jane Chu 
Cc: Matthew Wilcox (Oracle) 
Cc: Muchun Song 
Cc: Oscar Salvador 
Signed-off-by: Andrew Morton

mm: record the migration reason for struct migration_target_control

2024-04-26T03:56:06+00:00

Patch series "make the hugetlb migration strategy consistent", v2.

As discussed in previous thread [1], there is an inconsistency when
handling hugetlb migration.  When handling the migration of freed hugetlb,
it prevents fallback to other NUMA nodes in
alloc_and_dissolve_hugetlb_folio().  However, when dealing with in-use
hugetlb, it allows fallback to other NUMA nodes in
alloc_hugetlb_folio_nodemask(), which can break the per-node hugetlb pool
and might result in unexpected failures when node bound workloads doesn't
get what is asssumed available.

This patchset tries to make the hugetlb migration strategy more clear
and consistent. Please find details in each patch.

[1]
https://lore.kernel.org/all/6f26ce22d2fcd523418a085f2c588fe0776d46e7.1706794035.git.baolin.wang@linux.alibaba.com/


This patch (of 2):

To support different hugetlb allocation strategies during hugetlb
migration based on various migration reasons, record the migration reason
in the migration_target_control structure as a preparation.

Link: https://lkml.kernel.org/r/cover.1709719720.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/7b95d4981e07211f57139fc5b1f7ce91b920cee4.1709719720.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang 
Reviewed-by: Oscar Salvador 
Cc: David Hildenbrand 
Cc: Miaohe Lin 
Cc: Michal Hocko 
Cc: Muchun Song 
Cc: Naoya Horiguchi 
Signed-off-by: Andrew Morton