linux.git/arch/x86/kernel/setup.c, branch v6.12

x86/fred: Enable FRED right after init_mem_mapping()

2024-08-13T19:59:21+00:00

On 64-bit init_mem_mapping() relies on the minimal page fault handler
provided by the early IDT mechanism. The real page fault handler is
installed right afterwards into the IDT.

This is problematic on CPUs which have X86_FEATURE_FRED set because the
real page fault handler retrieves the faulting address from the FRED
exception stack frame and not from CR2, but that does obviously not work
when FRED is not yet enabled in the CPU.

To prevent this enable FRED right after init_mem_mapping() without
interrupt stacks. Those are enabled later in trap_init() after the CPU
entry area is set up.

[ tglx: Encapsulate the FRED details ]

Fixes: 14619d912b65 ("x86/fred: FRED entry/exit and dispatch code")
Reported-by: Hou Wenlong 
Suggested-by: Thomas Gleixner 
Signed-off-by: Xin Li (Intel) 
Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/all/20240709154048.3543361-4-xin@zytor.com

x86/setup: Parse the builtin command line before merging

2024-07-31T19:46:35+00:00

Commit in Fixes was added as a catch-all for cases where the cmdline is
parsed before being merged with the builtin one.

And promptly one issue appeared, see Link below. The microcode loader
really needs to parse it that early, but the merging happens later.

Reshuffling the early boot nightmare^W code to handle that properly would
be a painful exercise for another day so do the chicken thing and parse the
builtin cmdline too before it has been merged.

Fixes: 0c40b1c7a897 ("x86/setup: Warn when option parsing is done too early")
Reported-by: Mike Lothian 
Signed-off-by: Borislav Petkov (AMD) 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Thomas Gleixner 
Link: https://lore.kernel.org/all/20240730152108.GAZqkE5Dfi9AuKllRw@fat_crate.local
Link: https://lore.kernel.org/r/20240722152330.GCZp55ck8E_FT4kPnC@fat_crate.local

Merge tag 'efi-next-for-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi

2024-07-16T19:22:07+00:00

Pull EFI updates from Ard Biesheuvel:
 "Note the removal of the EFI fake memory map support - this is believed
  to be unused and no longer worth supporting. However, we could easily
  bring it back if needed.

  With recent developments regarding confidential VMs and unaccepted
  memory, combined with kexec, creating a known inaccurate view of the
  firmware's memory map and handing it to the OS is a feature we can
  live without, hence the removal. Alternatively, I could imagine making
  this feature mutually exclusive with those confidential VM related
  features, but let's try simply removing it first.

  Summary:

   - Drop support for the 'fake' EFI memory map on x86

   - Add an SMBIOS based tweak to the EFI stub instructing the firmware
     on x86 Macbook Pros to keep both GPUs enabled

   - Replace 0-sized array with flexible array in EFI memory attributes
     table handling

   - Drop redundant BSS clearing when booting via the native PE
     entrypoint on x86

   - Avoid returning EFI_SUCCESS when aborting on an out-of-memory
     condition

   - Cosmetic tweak for arm64 KASLR loading logic"

* tag 'efi-next-for-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  efi: Replace efi_memory_attributes_table_t 0-sized array with flexible array
  efi: Rename efi_early_memdesc_ptr() to efi_memdesc_ptr()
  arm64/efistub: Clean up KASLR logic
  x86/efistub: Drop redundant clearing of BSS
  x86/efistub: Avoid returning EFI_SUCCESS on error
  x86/efistub: Call Apple set_os protocol on dual GPU Intel Macs
  x86/efistub: Enable SMBIOS protocol handling for x86
  efistub/smbios: Simplify SMBIOS enumeration API
  x86/efi: Drop support for fake EFI memory maps

x86/efi: Drop support for fake EFI memory maps

2024-07-01T22:26:24+00:00

Between kexec and confidential VM support, handling the EFI memory maps
correctly on x86 is already proving to be rather difficult (as opposed
to other EFI architectures which manage to never modify the EFI memory
map to begin with)

EFI fake memory map support is essentially a development hack (for
testing new support for the 'special purpose' and 'more reliable' EFI
memory attributes) that leaked into production code. The regions marked
in this manner are not actually recognized as such by the firmware
itself or the EFI stub (and never have), and marking memory as 'more
reliable' seems rather futile if the underlying memory is just ordinary
RAM.

Marking memory as 'special purpose' in this way is also dubious, but may
be in use in production code nonetheless. However, the same should be
achievable by using the memmap= command line option with the ! operator.

EFI fake memmap support is not enabled by any of the major distros
(Debian, Fedora, SUSE, Ubuntu) and does not exist on other
architectures, so let's drop support for it.

Acked-by: Borislav Petkov (AMD) 
Acked-by: Dan Williams 
Signed-off-by: Ard Biesheuvel

x86/setup: Warn when option parsing is done too early

2024-05-27T16:54:45+00:00

Commit

  4faa0e5d6d79 ("x86/boot: Move kernel cmdline setup earlier in the boot process (again)")

fixed and issue where cmdline parsing would happen before the final
boot_command_line string has been built from the builtin and boot
cmdlines and thus cmdline arguments would get lost.

Add a check to catch any future wrong use ordering so that such issues
can be caught in time.

Signed-off-by: Borislav Petkov (AMD) 
Acked-by: Ingo Molnar 
Link: https://lore.kernel.org/r/20240409152541.GCZhVd9XIPXyTNd9vc@fat_crate.local

Merge tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

2024-05-19T16:21:03+00:00

Pull mm updates from Andrew Morton:
"The usual shower of singleton fixes and minor series all over MM,
documented (hopefully adequately) in the respective changelogs.
Notable series include:

- Lucas Stach has provided some page-mapping cleanup/consolidation/
maintainability work in the series "mm/treewide: Remove pXd_huge()
API".

- In the series "Allow migrate on protnone reference with
MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
MPOL_PREFERRED_MANY mode, yielding almost doubled performance in
one test.

- In their series "Memory allocation profiling" Kent Overstreet and
Suren Baghdasaryan have contributed a means of determining (via
/proc/allocinfo) whereabouts in the kernel memory is being
allocated: number of calls and amount of memory.

- Matthew Wilcox has provided the series "Various significant MM
patches" which does a number of rather unrelated things, but in
largely similar code sites.

- In his series "mm: page_alloc: freelist migratetype hygiene"
Johannes Weiner has fixed the page allocator's handling of
migratetype requests, with resulting improvements in compaction
efficiency.

- In the series "make the hugetlb migration strategy consistent"
Baolin Wang has fixed a hugetlb migration issue, which should
improve hugetlb allocation reliability.

- Liu Shixin has hit an I/O meltdown caused by readahead in a
memory-tight memcg. Addressed in the series "Fix I/O high when
memory almost met memcg limit".

- In the series "mm/filemap: optimize folio adding and splitting"
Kairui Song has optimized pagecache insertion, yielding ~10%
performance improvement in one test.

- Baoquan He has cleaned up and consolidated the early zone
initialization code in the series "mm/mm_init.c: refactor
free_area_init_core()".

- Baoquan has also redone some MM initializatio code in the series
"mm/init: minor clean up and improvement".

- MM helper cleanups from Christoph Hellwig in his series "remove
follow_pfn".

- More cleanups from Matthew Wilcox in the series "Various
page->flags cleanups".

- Vlastimil Babka has contributed maintainability improvements in the
series "memcg_kmem hooks refactoring".

- More folio conversions and cleanups in Matthew Wilcox's series:
"Convert huge_zero_page to huge_zero_folio"
"khugepaged folio conversions"
"Remove page_idle and page_young wrappers"
"Use folio APIs in procfs"
"Clean up __folio_put()"
"Some cleanups for memory-failure"
"Remove page_mapping()"
"More folio compat code removal"

- David Hildenbrand chipped in with "fs/proc/task_mmu: convert
hugetlb functions to work on folis".

- Code consolidation and cleanup work related to GUP's handling of
hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".

- Rick Edgecombe has developed some fixes to stack guard gaps in the
series "Cover a guard gap corner case".

- Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the
series "mm/ksm: fix ksm exec support for prctl".

- Baolin Wang has implemented NUMA balancing for multi-size THPs.
This is a simple first-cut implementation for now. The series is
"support multi-size THP numa balancing".

- Cleanups to vma handling helper functions from Matthew Wilcox in
the series "Unify vma_address and vma_pgoff_address".

- Some selftests maintenance work from Dev Jain in the series
"selftests/mm: mremap_test: Optimizations and style fixes".

- Improvements to the swapping of multi-size THPs from Ryan Roberts
in the series "Swap-out mTHP without splitting".

- Kefeng Wang has significantly optimized the handling of arm64's
permission page faults in the series
"arch/mm/fault: accelerate pagefault when badaccess"
"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"

- GUP cleanups from David Hildenbrand in "mm/gup: consistently call
it GUP-fast".

- hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault
path to use struct vm_fault".

- selftests build fixes from John Hubbard in the series "Fix
selftests/mm build without requiring "make headers"".

- Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
series "Improved Memory Tier Creation for CPUless NUMA Nodes".
Fixes the initialization code so that migration between different
memory types works as intended.

- David Hildenbrand has improved follow_pte() and fixed an errant
driver in the series "mm: follow_pte() improvements and acrn
follow_pte() fixes".

- David also did some cleanup work on large folio mapcounts in his
series "mm: mapcount for large folios + page_mapcount() cleanups".

- Folio conversions in KSM in Alex Shi's series "transfer page to
folio in KSM".

- Barry Song has added some sysfs stats for monitoring multi-size
THP's in the series "mm: add per-order mTHP alloc and swpout
counters".

- Some zswap cleanups from Yosry Ahmed in the series "zswap
same-filled and limit checking cleanups".

- Matthew Wilcox has been looking at buffer_head code and found the
documentation to be lacking. The series is "Improve buffer head
documentation".

- Multi-size THPs get more work, this time from Lance Yang. His
series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free"
optimizes the freeing of these things.

- Kemeng Shi has added more userspace-visible writeback
instrumentation in the series "Improve visibility of writeback".

- Kemeng Shi then sent some maintenance work on top in the series
"Fix and cleanups to page-writeback".

- Matthew Wilcox reduces mmap_lock traffic in the anon vma code in
the series "Improve anon_vma scalability for anon VMAs". Intel's
test bot reported an improbable 3x improvement in one test.

- SeongJae Park adds some DAMON feature work in the series
"mm/damon: add a DAMOS filter type for page granularity access recheck"
"selftests/damon: add DAMOS quota goal test"

- Also some maintenance work in the series
"mm/damon/paddr: simplify page level access re-check for pageout"
"mm/damon: misc fixes and improvements"

- David Hildenbrand has disabled some known-to-fail selftests ni the
series "selftests: mm: cow: flag vmsplice() hugetlb tests as
XFAIL".

- memcg metadata storage optimizations from Shakeel Butt in "memcg:
reduce memory consumption by memcg stats".

- DAX fixes and maintenance work from Vishal Verma in the series
"dax/bus.c: Fixups for dax-bus locking""

* tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits)
memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order
selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime
mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp
mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault
selftests: cgroup: add tests to verify the zswap writeback path
mm: memcg: make alloc_mem_cgroup_per_node_info() return bool
mm/damon/core: fix return value from damos_wmark_metric_value
mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED
selftests: cgroup: remove redundant enabling of memory controller
Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree
Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT
Docs/mm/damon/design: use a list for supported filters
Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command
Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file
selftests/damon: classify tests for functionalities and regressions
selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None'
selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts
selftests/damon/_damon_sysfs: check errors from nr_schemes file reads
mm/damon/core: initialize ->esz_bp from damos_quota_init_priv()
selftests/damon: add a test for DAMOS quota goal
...

Merge tag 'x86-cpu-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2024-05-14T01:44:44+00:00

Pull x86 cpu updates from Ingo Molnar:

 - Rework the x86 CPU vendor/family/model code: introduce the 'VFM'
   value that is an 8+8+8 bit concatenation of the vendor/family/model
   value, and add macros that work on VFM values. This simplifies the
   addition of new Intel models & families, and simplifies existing
   enumeration & quirk code.

 - Add support for the AMD 0x80000026 leaf, to better parse topology
   information

 - Optimize the NUMA allocation layout of more per-CPU data structures

 - Improve the workaround for AMD erratum 1386

 - Clear TME from /proc/cpuinfo as well, when disabled by the firmware

 - Improve x86 self-tests

 - Extend the mce_record tracepoint with the ::ppin and ::microcode fields

 - Implement recovery for MCE errors in TDX/SEAM non-root mode

 - Misc cleanups and fixes

* tag 'x86-cpu-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
  x86/mm: Switch to new Intel CPU model defines
  x86/tsc_msr: Switch to new Intel CPU model defines
  x86/tsc: Switch to new Intel CPU model defines
  x86/cpu: Switch to new Intel CPU model defines
  x86/resctrl: Switch to new Intel CPU model defines
  x86/microcode/intel: Switch to new Intel CPU model defines
  x86/mce: Switch to new Intel CPU model defines
  x86/cpu: Switch to new Intel CPU model defines
  x86/cpu/intel_epb: Switch to new Intel CPU model defines
  x86/aperfmperf: Switch to new Intel CPU model defines
  x86/apic: Switch to new Intel CPU model defines
  perf/x86/msr: Switch to new Intel CPU model defines
  perf/x86/intel/uncore: Switch to new Intel CPU model defines
  perf/x86/intel/pt: Switch to new Intel CPU model defines
  perf/x86/lbr: Switch to new Intel CPU model defines
  perf/x86/intel/cstate: Switch to new Intel CPU model defines
  x86/bugs: Switch to new Intel CPU model defines
  x86/bugs: Switch to new Intel CPU model defines
  x86/cpu/vfm: Update arch/x86/include/asm/intel-family.h
  x86/cpu/vfm: Add new macros to work with (vendor/family/model) values
  ...

x86: remove unneeded memblock_find_dma_reserve()

2024-04-26T03:56:10+00:00

Patch series "mm/mm_init.c: refactor free_area_init_core()".

In function free_area_init_core(), the code calculating
zone->managed_pages and the subtracting dma_reserve from DMA zone looks
very confusing.

From git history, the code calculating zone->managed_pages was for
zone->present_pages originally.  The early rough assignment is for
optimize zone's pcp and water mark setting.  Later, managed_pages was
introduced into zone to represent the number of managed pages by buddy. 
Now, zone->managed_pages is zeroed out and reset in mem_init() when
calling memblock_free_all().  zone's pcp and wmark setting relying on
actual zone->managed_pages are done later than mem_init() invocation.  So
we don't need rush to early calculate and set zone->managed_pages, just
set it as zone->present_pages, will adjust it in mem_init().

And also add a new function calc_nr_kernel_pages() to count up free but
not reserved pages in memblock, then assign it to nr_all_pages and
nr_kernel_pages after memmap pages are allocated.


This patch (of 6):

Variable dma_reserve and its usage was introduced in commit 0e0b864e069c
("[PATCH] Account for memmap and optionally the kernel image as holes"). 
Its original purpose was to accounting for the reserved pages in DMA zone
to make DMA zone's watermarks calculation more accurate on x86.

However, currently there's zone->managed_pages to account for all
available pages for buddy, zone->present_pages to account for all present
physical pages in zone.  What is more important, on x86, calculating and
setting the zone->managed_pages is a temporary move, all zone's
managed_pages will be zeroed out and reset to the actual value according
to how many pages are added to buddy allocator in mem_init().  Before
mem_init(), no buddy alloction is requested.  And zone's pcp and watermark
setting are all done after mem_init().  So, no need to worry about the DMA
zone's setting accuracy during free_area_init().

Hence, remove memblock_find_dma_reserve() to stop calculating and
setting dma_reserve.

Link: https://lkml.kernel.org/r/20240325145646.1044760-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20240325145646.1044760-2-bhe@redhat.com
Signed-off-by: Baoquan He 
Reviewed-by: Mike Rapoport (IBM) 
Signed-off-by: Andrew Morton

Merge tag 'v6.9-rc3' into x86/boot, to pick up fixes before queueing up more changes

2024-04-11T13:36:23+00:00

Signed-off-by: Ingo Molnar

Merge tag 'v6.9-rc3' into x86/cpu, to pick up fixes

2024-04-09T07:28:41+00:00

Signed-off-by: Ingo Molnar