linux.git/mm/memblock.c, branch v6.6

mm: disable kernelcore=mirror when no mirror memory

2023-08-21T20:37:43+00:00

For system with kernelcore=mirror enabled while no mirrored memory is
reported by efi.  This could lead to kernel OOM during startup since all
memory beside zone DMA are in the movable zone and this prevents the
kernel to use it.

Zone DMA/DMA32 initialization is independent of mirrored memory and their
max pfn is set in zone_sizes_init().  Since kernel can fallback to zone
DMA/DMA32 if there is no memory in zone Normal, these zones are seen as
mirrored memory no mather their memory attributes are.

To solve this problem, disable kernelcore=mirror when there is no real
mirrored memory exists.

Link: https://lkml.kernel.org/r/20230802072328.2107981-1-mawupeng1@huawei.com
Signed-off-by: Ma Wupeng 
Suggested-by: Kefeng Wang 
Suggested-by: Mike Rapoport 
Reviewed-by: Mike Rapoport (IBM) 
Reviewed-by: Kefeng Wang 
Cc: Levi Yun 
Signed-off-by: Andrew Morton

Revert "mm,memblock: reset memblock.reserved to system init state to prevent UAF"

2023-07-28T16:47:06+00:00

This reverts commit 9e46e4dcd9d6cd88342b028dbfa5f4fb7483d39c.

kbuild reports a warning in memblock_remove_region() because of a false
positive caused by partial reset of the memblock state.

Doing the full reset will remove the false positives, but will allow
late use of memblock_free() to go unnoticed, so it is better to revert
the offending commit.

   WARNING: CPU: 0 PID: 1 at mm/memblock.c:352 memblock_remove_region (kbuild/src/x86_64/mm/memblock.c:352 (discriminator 1))
   Modules linked in:
   CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.5.0-rc3-00001-g9e46e4dcd9d6 #2
   RIP: 0010:memblock_remove_region (kbuild/src/x86_64/mm/memblock.c:352 (discriminator 1))
   Call Trace:
     memblock_discard (kbuild/src/x86_64/mm/memblock.c:383)
     page_alloc_init_late (kbuild/src/x86_64/include/linux/find.h:208 kbuild/src/x86_64/include/linux/nodemask.h:266 kbuild/src/x86_64/mm/mm_init.c:2405)
     kernel_init_freeable (kbuild/src/x86_64/init/main.c:1325 kbuild/src/x86_64/init/main.c:1546)
     kernel_init (kbuild/src/x86_64/init/main.c:1439)
     ret_from_fork (kbuild/src/x86_64/arch/x86/kernel/process.c:145)
     ret_from_fork_asm (kbuild/src/x86_64/arch/x86/entry/entry_64.S:298)

Reported-by: kernel test robot 
Closes: https://lore.kernel.org/oe-lkp/202307271656.447aa17e-oliver.sang@intel.com
Signed-off-by: "Mike Rapoport (IBM)" 
Signed-off-by: Linus Torvalds

mm,memblock: reset memblock.reserved to system init state to prevent UAF

2023-07-24T05:52:56+00:00

The memblock_discard function frees the memblock.reserved.regions
array, which is good.

However, if a subsequent memblock_free (or memblock_phys_free) comes
in later, from for example ima_free_kexec_buffer, that will result in
a use after free bug in memblock_isolate_range.

When running a kernel with CONFIG_KASAN enabled, this will cause a
kernel panic very early in boot. Without CONFIG_KASAN, there is
a chance that memblock_isolate_range might scribble on memory
that is now in use by somebody else.

Avoid those issues by making sure that memblock_discard points
memblock.reserved.regions back at the static buffer.

If memblock_free is called after memblock memory is discarded, that will
print a warning in memblock_remove_region.

Signed-off-by: Rik van Riel 
Link: https://lore.kernel.org/r/20230719154137.732d8525@imladris.surriel.com
Signed-off-by: Mike Rapoport (IBM)

Merge tag 'memblock-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock

2023-06-30T06:21:20+00:00

Pull memblock updates from Mike Rapoport:

 - add test for memblock_alloc_node()

 - minor coding style fixes

 - add flags and nid info in memblock debugfs

* tag 'memblock-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
  memblock: Update nid info in memblock debugfs
  memblock: Add flags and nid info in memblock debugfs
  Fix some coding style errors in memblock.c
  Add tests for memblock_alloc_node()

Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

2023-06-28T17:28:11+00:00

Pull mm updates from Andrew Morton:

 - Yosry Ahmed brought back some cgroup v1 stats in OOM logs

 - Yosry has also eliminated cgroup's atomic rstat flushing

 - Nhat Pham adds the new cachestat() syscall. It provides userspace
   with the ability to query pagecache status - a similar concept to
   mincore() but more powerful and with improved usability

 - Mel Gorman provides more optimizations for compaction, reducing the
   prevalence of page rescanning

 - Lorenzo Stoakes has done some maintanance work on the
   get_user_pages() interface

 - Liam Howlett continues with cleanups and maintenance work to the
   maple tree code. Peng Zhang also does some work on maple tree

 - Johannes Weiner has done some cleanup work on the compaction code

 - David Hildenbrand has contributed additional selftests for
   get_user_pages()

 - Thomas Gleixner has contributed some maintenance and optimization
   work for the vmalloc code

 - Baolin Wang has provided some compaction cleanups,

 - SeongJae Park continues maintenance work on the DAMON code

 - Huang Ying has done some maintenance on the swap code's usage of
   device refcounting

 - Christoph Hellwig has some cleanups for the filemap/directio code

 - Ryan Roberts provides two patch series which yield some
   rationalization of the kernel's access to pte entries - use the
   provided APIs rather than open-coding accesses

 - Lorenzo Stoakes has some fixes to the interaction between pagecache
   and directio access to file mappings

 - John Hubbard has a series of fixes to the MM selftesting code

 - ZhangPeng continues the folio conversion campaign

 - Hugh Dickins has been working on the pagetable handling code, mainly
   with a view to reducing the load on the mmap_lock

 - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
   from 128 to 8

 - Domenico Cerasuolo has improved the zswap reclaim mechanism by
   reorganizing the LRU management

 - Matthew Wilcox provides some fixups to make gfs2 work better with the
   buffer_head code

 - Vishal Moola also has done some folio conversion work

 - Matthew Wilcox has removed the remnants of the pagevec code - their
   functionality is migrated over to struct folio_batch

* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
  mm/hugetlb: remove hugetlb_set_page_subpool()
  mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
  hugetlb: revert use of page_cache_next_miss()
  Revert "page cache: fix page_cache_next/prev_miss off by one"
  mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
  mm: memcg: rename and document global_reclaim()
  mm: kill [add|del]_page_to_lru_list()
  mm: compaction: convert to use a folio in isolate_migratepages_block()
  mm: zswap: fix double invalidate with exclusive loads
  mm: remove unnecessary pagevec includes
  mm: remove references to pagevec
  mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
  mm: remove struct pagevec
  net: convert sunrpc from pagevec to folio_batch
  i915: convert i915_gpu_error to use a folio_batch
  pagevec: rename fbatch_count()
  mm: remove check_move_unevictable_pages()
  drm: convert drm_gem_put_pages() to use a folio_batch
  i915: convert shmem_sg_free_table() to use a folio_batch
  scatterlist: add sg_set_folio()
  ...

mm: pass nid to reserve_bootmem_region()

2023-06-23T23:59:27+00:00

early_pfn_to_nid() is called frequently in init_reserved_page(), it
returns the node id of the PFN.  These PFN are probably from the same
memory region, they have the same node id.  It's not necessary to call
early_pfn_to_nid() for each PFN.

Pass nid to reserve_bootmem_region() and drop the call to
early_pfn_to_nid() in init_reserved_page().  Also, set nid on all reserved
pages before doing this, as some reserved memory regions may not be set
nid.

The most beneficial function is memmap_init_reserved_pages() if
CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled.

The following data was tested on an x86 machine with 190GB of RAM.

before:
memmap_init_reserved_pages()  67ms

after:
memmap_init_reserved_pages()  20ms

Link: https://lkml.kernel.org/r/20230619023406.424298-1-yajun.deng@linux.dev
Signed-off-by: Yajun Deng 
Reviewed-by: Mike Rapoport (IBM) 
Signed-off-by: Andrew Morton

mm/memory_hotplug: remove reset_node_managed_pages() in hotadd_init_pgdat()

2023-06-19T23:19:05+00:00

managed pages has already been set to 0 in free_area_init_core_hotplug(),
via zone_init_internals() on each zone.  It's pointless to reset again.

Furthermore, reset_node_managed_pages() no longer needs to be exposed
outside of mm/memblock.c.  Remove declaration in include/linux/memblock.h
and define it as static.

In addtion to this, the only caller of reset_node_managed_pages() is
reset_all_zones_managed_pages(), which is annotated with __init, so it
should be safe to also mark reset_node_managed_pages() as __init.

Link: https://lkml.kernel.org/r/20230607024548.1240-1-haifeng.xu@shopee.com
Signed-off-by: Haifeng Xu 
Suggested-by: David Hildenbrand 
Cc: Michal Hocko 
Cc: Mike Rapoport (IBM) 
Cc: Oscar Salvador 
Signed-off-by: Andrew Morton

mm: Add support for unaccepted memory

2023-06-06T14:38:22+00:00

UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, require memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific to the Virtual Machine
platform.

There are several ways the kernel can deal with unaccepted memory:

 1. Accept all the memory during boot. It is easy to implement and it
    doesn't have runtime cost once the system is booted. The downside is
    very long boot time.

    Accept can be parallelized to multiple CPUs to keep it manageable
    (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
    memory bandwidth and does not scale beyond the point.

 2. Accept a block of memory on the first use. It requires more
    infrastructure and changes in page allocator to make it work, but
    it provides good boot time.

    On-demand memory accept means latency spikes every time kernel steps
    onto a new memory block. The spikes will go away once workload data
    set size gets stabilized or all memory gets accepted.

 3. Accept all memory in background. Introduce a thread (or multiple)
    that gets memory accepted proactively. It will minimize time the
    system experience latency spikes on memory allocation while keeping
    low boot time.

    This approach cannot function on its own. It is an extension of #2:
    background memory acceptance requires functional scheduler, but the
    page allocator may need to tap into unaccepted memory before that.

    The downside of the approach is that these threads also steal CPU
    cycles and memory bandwidth from the user's workload and may hurt
    user experience.

Implement #1 and #2 for now. #2 is the default. Some workloads may want
to use #1 with accept_memory=eager in kernel command line. #3 can be
implemented later based on user's demands.

Support of unaccepted memory requires a few changes in core-mm code:

  - memblock accepts memory on allocation. It serves early boot memory
    allocations and doesn't limit them to pre-accepted pool of memory.

  - page allocator accepts memory on the first allocation of the page.
    When kernel runs out of accepted memory, it accepts memory until the
    high watermark is reached. It helps to minimize fragmentation.

EFI code will provide two helpers if the platform supports unaccepted
memory:

 - accept_memory() makes a range of physical addresses accepted.

 - range_contains_unaccepted_memory() checks anything within the range
   of physical addresses requires acceptance.

Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Vlastimil Babka 
Acked-by: Mike Rapoport 	# memblock
Link: https://lore.kernel.org/r/20230606142637.5171-2-kirill.shutemov@linux.intel.com

memblock: Update nid info in memblock debugfs

2023-06-02T05:23:41+00:00

The node id for memblock reserved regions will be wrong,
so let's show 'x' for reg->nid == MAX_NUMNODES in debugfs to keep it align.

Suggested-by: Mike Rapoport (IBM) 
Co-developed-by: Kefeng Wang 
Signed-off-by: Kefeng Wang 
Signed-off-by: Yuwei Guan 
Link: https://lore.kernel.org/r/20230601133149.37160-1-ssawgyw@gmail.com
Signed-off-by: Mike Rapoport (IBM)

memblock: Add flags and nid info in memblock debugfs

2023-05-24T08:56:30+00:00

Currently, the memblock debugfs can display the count of memblock_type and
the base and end of the reg. However, when memblock_mark_*() or
memblock_set_node() is executed on some range, the information in the
existing debugfs cannot make it clear why the address is not consecutive.

For example,
cat /sys/kernel/debug/memblock/memory
   0: 0x0000000080000000..0x00000000901fffff
   1: 0x0000000090200000..0x00000000905fffff
   2: 0x0000000090600000..0x0000000092ffffff
   3: 0x0000000093000000..0x00000000973fffff
   4: 0x0000000097400000..0x00000000b71fffff
   5: 0x00000000c0000000..0x00000000dfffffff
   6: 0x00000000e2500000..0x00000000f87fffff
   7: 0x00000000f8800000..0x00000000fa7fffff
   8: 0x00000000fa800000..0x00000000fd3effff
   9: 0x00000000fd3f0000..0x00000000fd3fefff
  10: 0x00000000fd3ff000..0x00000000fd7fffff
  11: 0x00000000fd800000..0x00000000fd901fff
  12: 0x00000000fd902000..0x00000000fd909fff
  13: 0x00000000fd90a000..0x00000000fd90bfff
  14: 0x00000000fd90c000..0x00000000ffffffff
  15: 0x0000000880000000..0x0000000affffffff

So we can add flags and nid to this debugfs.

For example,
cat /sys/kernel/debug/memblock/memory
   0: 0x0000000080000000..0x00000000901fffff    0 NONE
   1: 0x0000000090200000..0x00000000905fffff    0 NOMAP
   2: 0x0000000090600000..0x0000000092ffffff    0 NONE
   3: 0x0000000093000000..0x00000000973fffff    0 NOMAP
   4: 0x0000000097400000..0x00000000b71fffff    0 NONE
   5: 0x00000000c0000000..0x00000000dfffffff    0 NONE
   6: 0x00000000e2500000..0x00000000f87fffff    0 NONE
   7: 0x00000000f8800000..0x00000000fa7fffff    0 NOMAP
   8: 0x00000000fa800000..0x00000000fd3effff    0 NONE
   9: 0x00000000fd3f0000..0x00000000fd3fefff    0 NOMAP
  10: 0x00000000fd3ff000..0x00000000fd7fffff    0 NONE
  11: 0x00000000fd800000..0x00000000fd901fff    0 NOMAP
  12: 0x00000000fd902000..0x00000000fd909fff    0 NONE
  13: 0x00000000fd90a000..0x00000000fd90bfff    0 NOMAP
  14: 0x00000000fd90c000..0x00000000ffffffff    0 NONE
  15: 0x0000000880000000..0x0000000affffffff    0 NONE

Signed-off-by: Yuwei Guan 
Reviewed-by: Anshuman Khandual 
Reviewed-by: Kefeng Wang 
Link: https://lore.kernel.org/r/20230519105321.333-1-ssawgyw@gmail.com
Signed-off-by: Mike Rapoport (IBM)