linux.git/arch/x86/boot, branch v6.17

Merge tag 'x86_urgent_for_v6.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2025-08-17T13:53:15+00:00

Pull x86 fixes from Borislav Petkov:

 - Remove a transitional asm/cpuid.h header which was added only as a
   fallback during cpuid helpers reorg

 - Initialize reserved fields in the SVSM page validation calls
   structure to zero in order to allow for future structure extensions

 - Have the sev-guest driver's buffers used in encryption operations be
   in linear mapping space as the encryption operation can be offloaded
   to an accelerator

 - Have a read-only MSR write when in an AMD SNP guest trap to the
   hypervisor as it is usually done. This makes the guest user
   experience better by simply raising a #GP instead of terminating said
   guest

 - Do not output AVX512 elapsed time for kernel threads because the data
   is wrong and fix a NULL pointer dereferencing in the process

 - Adjust the SRSO mitigation selection to the new attack vectors

* tag 'x86_urgent_for_v6.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpuid: Remove transitional  header
  x86/sev: Ensure SVSM reserved fields in a page validation entry are initialized to zero
  virt: sev-guest: Satisfy linear mapping requirement in get_derived_key()
  x86/sev: Improve handling of writes to intercepted TSC MSRs
  x86/fpu: Fix NULL dereference in avx512_status()
  x86/bugs: Select best SRSO mitigation

x86/sev: Ensure SVSM reserved fields in a page validation entry are initialized to zero

2025-08-15T15:06:17+00:00

In order to support future versions of the SVSM_CORE_PVALIDATE call, all
reserved fields within a PVALIDATE entry must be set to zero as an SVSM should
be ensuring all reserved fields are zero in order to support future usage of
reserved areas based on the protocol version.

Fixes: fcd042e86422 ("x86/sev: Perform PVALIDATE using the SVSM when not at VMPL0")
Signed-off-by: Tom Lendacky 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Joerg Roedel 
Cc: 
Link: https://lore.kernel.org/7cde412f8b057ea13a646fb166b1ca023f6a5031.1755098819.git.thomas.lendacky@amd.com

x86/sev: Evict cache lines during SNP memory validation

2025-08-06T17:17:22+00:00

An SNP cache coherency vulnerability requires a cache line eviction
mitigation when validating memory after a page state change to private.
The specific mitigation is to touch the first and last byte of each 4K
page that is being validated. There is no need to perform the mitigation
when performing a page state change to shared and rescinding validation.

CPUID bit Fn8000001F_EBX[31] defines the COHERENCY_SFW_NO CPUID bit
that, when set, indicates that the software mitigation for this
vulnerability is not needed.

Implement the mitigation and invoke it when validating memory (making it
private) and the COHERENCY_SFW_NO bit is not set, indicating the SNP
guest is vulnerable.

Co-developed-by: Michael Roth 
Signed-off-by: Michael Roth 
Signed-off-by: Tom Lendacky 
Signed-off-by: Borislav Petkov (AMD) 
Acked-by: Thomas Gleixner

x86/efi: Implement support for embedding SBAT data for x86

2025-06-21T11:53:44+00:00

Similar to zboot architectures, implement support for embedding SBAT data
for x86. Put '.sbat' section in between '.data' and '.text' as the former
also covers '.bss' and '.pgtable' and thus must be the last one in the
file.

Signed-off-by: Vitaly Kuznetsov 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Ard Biesheuvel 
Link: https://lore.kernel.org/20250603091951.57775-1-vkuznets@redhat.com

Merge tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

2025-05-31T22:44:16+00:00

Pull MM updates from Andrew Morton:

- "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of
creating a pte which addresses the first page in a folio and reduces
the amount of plumbing which architecture must implement to provide
this.

- "Misc folio patches for 6.16" from Matthew Wilcox is a shower of
largely unrelated folio infrastructure changes which clean things up
and better prepare us for future work.

- "memory,x86,acpi: hotplug memory alignment advisement" from Gregory
Price adds early-init code to prevent x86 from leaving physical
memory unused when physical address regions are not aligned to memory
block size.

- "mm/compaction: allow more aggressive proactive compaction" from
Michal Clapinski provides some tuning of the (sadly, hard-coded (more
sadly, not auto-tuned)) thresholds for our invokation of proactive
compaction. In a simple test case, the reduction of a guest VM's
memory consumption was dramatic.

- "Minor cleanups and improvements to swap freeing code" from Kemeng
Shi provides some code cleaups and a small efficiency improvement to
this part of our swap handling code.

- "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin
adds the ability for a ptracer to modify syscalls arguments. At this
time we can alter only "system call information that are used by
strace system call tampering, namely, syscall number, syscall
arguments, and syscall return value.

This series should have been incorporated into mm.git's "non-MM"
branch, but I goofed.

- "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from
Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl
against /proc/pid/pagemap. This permits CRIU to more efficiently get
at the info about guard regions.

- "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan
implements that fix. No runtime effect is expected because
validate_page_before_insert() happens to fix up this error.

- "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David
Hildenbrand basically brings uprobe text poking into the current
decade. Remove a bunch of hand-rolled implementation in favor of
using more current facilities.

- "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman
Khandual provides enhancements and generalizations to the pte dumping
code. This might be needed when 128-bit Page Table Descriptors are
enabled for ARM.

- "Always call constructor for kernel page tables" from Kevin Brodsky
ensures that the ctor/dtor is always called for kernel pgtables, as
it already is for user pgtables.

This permits the addition of more functionality such as "insert hooks
to protect page tables". This change does result in various
architectures performing unnecesary work, but this is fixed up where
it is anticipated to occur.

- "Rust support for mm_struct, vm_area_struct, and mmap" from Alice
Ryhl adds plumbing to permit Rust access to core MM structures.

- "fix incorrectly disallowed anonymous VMA merges" from Lorenzo
Stoakes takes advantage of some VMA merging opportunities which we've
been missing for 15 years.

- "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from
SeongJae Park optimizes process_madvise()'s TLB flushing.

Instead of flushing each address range in the provided iovec, we
batch the flushing across all the iovec entries. The syscall's cost
was approximately halved with a microbenchmark which was designed to
load this particular operation.

- "Track node vacancy to reduce worst case allocation counts" from
Sidhartha Kumar makes the maple tree smarter about its node
preallocation.

stress-ng mmap performance increased by single-digit percentages and
the amount of unnecessarily preallocated memory was dramaticelly
reduced.

- "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes
a few unnecessary things which Baoquan noted when reading the code.

- ""Enhance sysfs handling for memory hotplug in weighted interleave"
from Rakie Kim "enhances the weighted interleave policy in the memory
management subsystem by improving sysfs handling, fixing memory
leaks, and introducing dynamic sysfs updates for memory hotplug
support". Fixes things on error paths which we are unlikely to hit.

- "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory"
from SeongJae Park introduces new DAMOS quota goal metrics which
eliminate the manual tuning which is required when utilizing DAMON
for memory tiering.

- "mm/vmalloc.c: code cleanup and improvements" from Baoquan He
provides cleanups and small efficiency improvements which Baoquan
found via code inspection.

- "vmscan: enforce mems_effective during demotion" from Gregory Price
changes reclaim to respect cpuset.mems_effective during demotion when
possible. because presently, reclaim explicitly ignores
cpuset.mems_effective when demoting, which may cause the cpuset
settings to violated.

This is useful for isolating workloads on a multi-tenant system from
certain classes of memory more consistently.

- "Clean up split_huge_pmd_locked() and remove unnecessary folio
pointers" from Gavin Guo provides minor cleanups and efficiency gains
in in the huge page splitting and migrating code.

- "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache
for `struct mem_cgroup', yielding improved memory utilization.

- "add max arg to swappiness in memory.reclaim and lru_gen" from
Zhongkun He adds a new "max" argument to the "swappiness=" argument
for memory.reclaim MGLRU's lru_gen.

This directs proactive reclaim to reclaim from only anon folios
rather than file-backed folios.

- "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the
first step on the path to permitting the kernel to maintain existing
VMs while replacing the host kernel via file-based kexec. At this
time only memblock's reserve_mem is preserved.

- "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides
and uses a smarter way of looping over a pfn range. By skipping
ranges of invalid pfns.

- "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning
when a task is pinned a single NUMA mode.

Dramatic performance benefits were seen in some real world cases.

- "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank
Garg addresses a warning which occurs during memory compaction when
using JFS.

- "move all VMA allocation, freeing and duplication logic to mm" from
Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more
appropriate mm/vma.c.

- "mm, swap: clean up swap cache mapping helper" from Kairui Song
provides code consolidation and cleanups related to the folio_index()
function.

- "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that.

- "memcg: Fix test_memcg_min/low test failures" from Waiman Long
addresses some bogus failures which are being reported by the
test_memcontrol selftest.

- "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo
Stoakes commences the deprecation of file_operations.mmap() in favor
of the new file_operations.mmap_prepare().

The latter is more restrictive and prevents drivers from messing with
things in ways which, amongst other problems, may defeat VMA merging.

- "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples
the per-cpu memcg charge cache from the objcg's one.

This is a step along the way to making memcg and objcg charging
NMI-safe, which is a BPF requirement.

- "mm/damon: minor fixups and improvements for code, tests, and
documents" from SeongJae Park is yet another batch of miscellaneous
DAMON changes. Fix and improve minor problems in code, tests and
documents.

- "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg
stats to be irq safe. Another step along the way to making memcg
charging and stats updates NMI-safe, a BPF requirement.

- "Let unmap_hugepage_range() and several related functions take folio
instead of page" from Fan Ni provides folio conversions in the
hugetlb code.

* tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits)
mm: pcp: increase pcp->free_count threshold to trigger free_high
mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range()
mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page
mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page
mm/hugetlb: pass folio instead of page to unmap_ref_private()
memcg: objcg stock trylock without irq disabling
memcg: no stock lock for cpu hot-unplug
memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs
memcg: make count_memcg_events re-entrant safe against irqs
memcg: make mod_memcg_state re-entrant safe against irqs
memcg: move preempt disable to callers of memcg_rstat_updated
memcg: memcg_rstat_updated re-entrant safe against irqs
mm: khugepaged: decouple SHMEM and file folios' collapse
selftests/eventfd: correct test name and improve messages
alloc_tag: check mem_profiling_support in alloc_tag_init
Docs/damon: update titles and brief introductions to explain DAMOS
selftests/damon/_damon_sysfs: read tried regions directories in order
mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject()
mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat()
mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs
...

Merge tag 'efi-next-for-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi

2025-05-30T19:42:57+00:00

Pull EFI updates from Ard Biesheuvel:
 "Not a lot going on in the EFI tree this cycle. The only thing that
  stands out is the new support for SBAT metadata, which was a bit
  contentious when it was first proposed, because in the initial
  incarnation, it would have required us to maintain a revocation index,
  and bump it each time a vulnerability affecting UEFI secure boot got
  fixed. This was shot down for obvious reasons.

  This time, only the changes needed to emit the SBAT section into the
  PE/COFF image are being carried upstream, and it is up to the distros
  to decide what to put in there when creating and signing the build.

  This only has the EFI zboot bits (which the distros will be using for
  arm64); the x86 bzImage changes should be arriving next cycle,
  presumably via the -tip tree.

  Summary:

   - Add support for emitting a .sbat section into the EFI zboot image,
     so that downstreams can easily include revocation metadata in the
     signed EFI images

   - Align PE symbolic constant names with other projects

   - Bug fix for the efi_test module

   - Log the physical address and size of the EFI memory map when
     failing to map it

   - A kerneldoc fix for the EFI stub code"

* tag 'efi-next-for-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  include: pe.h: Fix PE definitions
  efi/efi_test: Fix missing pending status update in getwakeuptime
  efi: zboot specific mechanism for embedding SBAT section
  efi/libstub: Describe missing 'out' parameter in efi_load_initrd
  efi: Improve logging around memmap init

include: pe.h: Fix PE definitions

2025-05-21T14:46:37+00:00

* Rename constants to their standard PE names:
  - MZ_MAGIC -> IMAGE_DOS_SIGNATURE
  - PE_MAGIC -> IMAGE_NT_SIGNATURE
  - PE_OPT_MAGIC_PE32_ROM -> IMAGE_ROM_OPTIONAL_HDR_MAGIC
  - PE_OPT_MAGIC_PE32 -> IMAGE_NT_OPTIONAL_HDR32_MAGIC
  - PE_OPT_MAGIC_PE32PLUS -> IMAGE_NT_OPTIONAL_HDR64_MAGIC
  - IMAGE_DLL_CHARACTERISTICS_NX_COMPAT -> IMAGE_DLLCHARACTERISTICS_NX_COMPAT

* Import constants and their description from readpe and file projects
  which contains current up-to-date information:
  - IMAGE_FILE_MACHINE_*
  - IMAGE_FILE_*
  - IMAGE_SUBSYSTEM_*
  - IMAGE_DLLCHARACTERISTICS_*
  - IMAGE_DLLCHARACTERISTICS_EX_*
  - IMAGE_DEBUG_TYPE_*

* Add missing IMAGE_SCN_* constants and update their incorrect description

* Fix incorrect value of IMAGE_SCN_MEM_PURGEABLE constant

* Add description for win32_version and loader_flags PE fields

Signed-off-by: Pali Rohár 
Signed-off-by: Ard Biesheuvel

x86/mm/64: Make 5-level paging support unconditional

2025-05-17T08:38:16+00:00

Both Intel and AMD CPUs support 5-level paging, which is expected to
become more widely adopted in the future. All major x86 Linux
distributions have the feature enabled.

Remove CONFIG_X86_5LEVEL and related #ifdeffery for it to make it more readable.

Suggested-by: Borislav Petkov 
Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Ingo Molnar 
Reviewed-by: Ard Biesheuvel 
Reviewed-by: Borislav Petkov (AMD) 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Link: https://lore.kernel.org/r/20250516123306.3812286-4-kirill.shutemov@linux.intel.com

x86/cpuid: Set as the main CPUID header

2025-05-15T16:23:55+00:00

The main CPUID header  was originally a storefront for the
headers:

    
    

Now that the latter CPUID(0x2) header has been merged into the former,
there is no practical difference between  and
.

Migrate all users to the  header, in preparation of
the removal of .

Don't remove  just yet, in case some new code in -next
started using it.

Suggested-by: Ingo Molnar 
Signed-off-by: Ahmed S. Darwish 
Signed-off-by: Ingo Molnar 
Cc: Andrew Cooper 
Cc: H. Peter Anvin 
Cc: John Ogness 
Cc: x86-cpuid@lists.linux.dev
Link: https://lore.kernel.org/r/20250508150240.172915-3-darwi@linutronix.de

x86/boot: Defer initialization of VM space related global variables

2025-05-14T08:06:35+00:00

The global pseudo-constants 'page_offset_base', 'vmalloc_base' and
'vmemmap_base' are not used extremely early during the boot, and cannot be
used safely until after the KASLR memory randomization code in
kernel_randomize_memory() executes, which may update their values.

So there is no point in setting these variables extremely early, and it
can wait until after the kernel itself is mapped and running from its
permanent virtual mapping.

Signed-off-by: Ard Biesheuvel 
Signed-off-by: Ingo Molnar 
Cc: Linus Torvalds 
Link: https://lore.kernel.org/r/20250513111157.717727-9-ardb+git@google.com