linux-stable.git/arch/x86, branch linux-6.3.y

x86/efi: Make efi_set_virtual_address_map IBT safe

2023-07-11T17:39:51+00:00

[ Upstream commit 0303c9729afc4094ef53e552b7b8cff7436028d6 ]

Niklāvs reported a boot regression on an Alderlake machine and bisected it
to commit 9df9d2f0471b ("init: Invoke arch_cpu_finalize_init() earlier").

By moving the invocation of arch_cpu_finalize_init() further down he
identified that efi_enter_virtual_mode() is the function which causes the
boot hang.

The main difference of the earlier invocation is that the boot CPU is
already fully initialized and mitigations and alternatives are applied.

But the only really interesting change turned out to be IBT, which is now
enabled before efi_enter_virtual_mode(). "ibt=off" on the kernel command
line cured the problem.

Inspection of the involved calls in efi_enter_virtual_mode() unearthed that
efi_set_virtual_address_map() is the only place in the kernel which invokes
an EFI call without the IBT safe wrapper. This went obviously unnoticed so
far as IBT was enabled later.

Use arch_efi_call_virt() instead of efi_call() to cure that.

Fixes: fe379fa4d199 ("x86/ibt: Disable IBT around firmware")
Fixes: 9df9d2f0471b ("init: Invoke arch_cpu_finalize_init() earlier")
Reported-by: Niklāvs Koļesņikovs 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Ard Biesheuvel 
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217602
Link: https://lore.kernel.org/r/87jzvm12q0.ffs@tglx
Signed-off-by: Sasha Levin

x86/mm: Fix __swp_entry_to_pte() for Xen PV guests

2023-07-11T17:39:26+00:00

[ Upstream commit 0f88130e8a6fd185b0aeb5d8e286083735f2585a ]

Normally __swp_entry_to_pte() is never called with a value translating
to a valid PTE. The only known exception is pte_swap_tests(), resulting
in a WARN splat in Xen PV guests, as __pte_to_swp_entry() did
translate the PFN of the valid PTE to a guest local PFN, while
__swp_entry_to_pte() doesn't do the opposite translation.

Fix that by using __pte() in __swp_entry_to_pte() instead of open
coding the native variant of it.

For correctness do the similar conversion for __swp_entry_to_pmd().

Fixes: 05289402d717 ("mm/debug_vm_pgtable: add tests validating arch helpers for core MM features")
Signed-off-by: Juergen Gross 
Signed-off-by: Borislav Petkov (AMD) 
Link: https://lore.kernel.org/r/20230306123259.12461-1-jgross@suse.com
Signed-off-by: Sasha Levin

perf/ibs: Fix interface via core pmu events

2023-07-11T17:39:25+00:00

[ Upstream commit 2fad201fe38ff9a692acedb1990ece2c52a29f95 ]

Although, IBS pmus can be invoked via their own interface, indirect
IBS invocation via core pmu events is also supported with fixed set
of events: cpu-cycles:p, r076:p (same as cpu-cycles:p) and r0C1:p
(micro-ops) for user convenience.

This indirect IBS invocation is broken since commit 66d258c5b048
("perf/core: Optimize perf_init_event()"), which added RAW pmu under
'pmu_idr' list and thus if event_init() fails with RAW pmu, it started
returning error instead of trying other pmus.

Forward precise events from core pmu to IBS by overwriting 'type' and
'config' in the kernel copy of perf_event_attr. Overwriting will cause
perf_init_event() to retry with updated 'type' and 'config', which will
automatically forward event to IBS pmu.

Without patch:
  $ sudo ./perf record -C 0 -e r076:p -- sleep 1
  Error:
  The r076:p event is not supported.

With patch:
  $ sudo ./perf record -C 0 -e r076:p -- sleep 1
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.341 MB perf.data (37 samples) ]

Fixes: 66d258c5b048 ("perf/core: Optimize perf_init_event()")
Reported-by: Stephane Eranian 
Signed-off-by: Ravi Bangoria 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20230504110003.2548-3-ravi.bangoria@amd.com
Signed-off-by: Sasha Levin

x86/xen: Set MTRR state when running as Xen PV initial domain

2023-07-11T17:39:25+00:00

[ Upstream commit a153f254e5cdf8fa3a1df90a6ffed3063fede154 ]

When running as Xen PV initial domain (aka dom0), MTRRs are disabled
by the hypervisor, but the system should nevertheless use correct
cache memory types. This has always kind of worked, as disabled MTRRs
resulted in disabled PAT, too, so that the kernel avoided code paths
resulting in inconsistencies. This bypassed all of the sanity checks
the kernel is doing with enabled MTRRs in order to avoid memory
mappings with conflicting memory types.

This has been changed recently, leading to PAT being accepted to be
enabled, while MTRRs stayed disabled. The result is that
mtrr_type_lookup() no longer is accepting all memory type requests,
but started to return WB even if UC- was requested. This led to
driver failures during initialization of some devices.

In reality MTRRs are still in effect, but they are under complete
control of the Xen hypervisor. It is possible, however, to retrieve
the MTRR settings from the hypervisor.

In order to fix those problems, overwrite the MTRR state via
mtrr_overwrite_state() with the MTRR data from the hypervisor, if the
system is running as a Xen dom0.

Fixes: 72cbc8f04fe2 ("x86/PAT: Have pat_enabled() properly reflect state when running on Xen")
Signed-off-by: Juergen Gross
Signed-off-by: Borislav Petkov (AMD)
Reviewed-by: Boris Ostrovsky
Tested-by: Michael Kelley
Link: https://lore.kernel.org/r/20230502120931.20719-6-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD)
Signed-off-by: Sasha Levin

x86/mtrr: Support setting MTRR state for software defined MTRRs

2023-07-11T17:39:25+00:00

[ Upstream commit 29055dc74287467bd7a053d60b4afe753832960d ]

When running virtualized, MTRR access can be reduced (e.g. in Xen PV
guests or when running as a SEV-SNP guest under Hyper-V). Typically, the
hypervisor will not advertize the MTRR feature in CPUID data, resulting
in no MTRR memory type information being available for the kernel.

This has turned out to result in problems (Link tags below):

- Hyper-V SEV-SNP guests using uncached mappings where they shouldn't
- Xen PV dom0 mapping memory as WB which should be UC- instead

Solve those problems by allowing an MTRR static state override,
overwriting the empty state used today. In case such a state has been
set, don't call get_mtrr_state() in mtrr_bp_init().

The set state will only be used by mtrr_type_lookup(), as in all other
cases mtrr_enabled() is being checked, which will return false. Accept
the overwrite call only for selected cases when running as a guest.
Disable X86_FEATURE_MTRR in order to avoid any MTRR modifications by
just refusing them.

  [ bp: Massage. ]

Signed-off-by: Juergen Gross 
Signed-off-by: Borislav Petkov (AMD) 
Tested-by: Michael Kelley 
Link: https://lore.kernel.org/all/4fe9541e-4d4c-2b2a-f8c8-2d34a7284930@nerdbynature.de/
Link: https://lore.kernel.org/lkml/BYAPR21MB16883ABC186566BD4D2A1451D7FE9@BYAPR21MB1688.namprd21.prod.outlook.com
Signed-off-by: Borislav Petkov (AMD) 
Stable-dep-of: a153f254e5cd ("x86/xen: Set MTRR state when running as Xen PV initial domain")
Signed-off-by: Sasha Levin

x86/mtrr: Replace size_or_mask and size_and_mask with a much easier concept

2023-07-11T17:39:25+00:00

[ Upstream commit d053b481a5f16dbd4f020c6b3ebdf9173fdef0e2 ]

Replace size_or_mask and size_and_mask with the much easier concept of
high reserved bits.

While at it, instead of using constants in the MTRR code, use some new

  [ bp:
   - Drop mtrr_set_mask()
   - Unbreak long lines
   - Move struct mtrr_state_type out of the uapi header as it doesn't
     belong there. It also fixes a HDRTEST breakage "unknown type name ‘bool’"
     as Reported-by: kernel test robot 
   - Massage.
  ]

Signed-off-by: Juergen Gross 
Signed-off-by: Borislav Petkov (AMD) 
Link: https://lore.kernel.org/r/20230502120931.20719-3-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD) 
Stable-dep-of: a153f254e5cd ("x86/xen: Set MTRR state when running as Xen PV initial domain")
Signed-off-by: Sasha Levin

x86/mtrr: Remove physical address size calculation

2023-07-11T17:39:25+00:00

[ Upstream commit f6b980646b93a8c585b4ed991b8a34e8fc6ef847 ]

The physical address width calculation in mtrr_bp_init() can easily be
replaced with using the already available value x86_phys_bits from
struct cpuinfo_x86.

The same information source can be used in mtrr/cleanup.c, removing the
need to pass that value on to mtrr_cleanup().

In print_mtrr_state() use x86_phys_bits instead of recalculating it
from size_or_mask.

Move setting of size_or_mask and size_and_mask into a dedicated new
function in mtrr/generic.c, enabling to make those 2 variables static,
as they are used in generic.c only now.

Signed-off-by: Juergen Gross 
Signed-off-by: Borislav Petkov (AMD) 
Tested-by: Michael Kelley 
Link: https://lore.kernel.org/r/20230502120931.20719-2-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD) 
Stable-dep-of: a153f254e5cd ("x86/xen: Set MTRR state when running as Xen PV initial domain")
Signed-off-by: Sasha Levin

x86/tdx: Fix race between set_memory_encrypted() and load_unaligned_zeropad()

2023-07-11T17:39:24+00:00

[ Upstream commit 195edce08b63d293377f615f4f7f086715d2d212 ]

tl;dr: There is a race in the TDX private<=>shared conversion code
       which could kill the TDX guest.  Fix it by changing conversion
       ordering to eliminate the window.

TDX hardware maintains metadata to track which pages are private and
shared. Additionally, TDX guests use the guest x86 page tables to
specify whether a given mapping is intended to be private or shared.
Bad things happen when the intent and metadata do not match.

So there are two thing in play:
 1. "the page" -- the physical TDX page metadata
 2. "the mapping" -- the guest-controlled x86 page table intent

For instance, an unrecoverable exit to VMM occurs if a guest touches a
private mapping that points to a shared physical page.

In summary:
	* Private mapping => Private Page == OK (obviously)
	* Shared mapping  => Shared Page  == OK (obviously)
	* Private mapping => Shared Page  == BIG BOOM!
	* Shared mapping  => Private Page == OK-ish
	  (It will read generate a recoverable #VE via handle_mmio())

Enter load_unaligned_zeropad(). It can touch memory that is adjacent but
otherwise unrelated to the memory it needs to touch. It will cause one
of those unrecoverable exits (aka. BIG BOOM) if it blunders into a
shared mapping pointing to a private page.

This is a problem when __set_memory_enc_pgtable() converts pages from
shared to private. It first changes the mapping and second modifies
the TDX page metadata.  It's moving from:

        * Shared mapping  => Shared Page  == OK
to:
        * Private mapping => Shared Page  == BIG BOOM!

This means that there is a window with a shared mapping pointing to a
private page where load_unaligned_zeropad() can strike.

Add a TDX handler for guest.enc_status_change_prepare(). This converts
the page from shared to private *before* the page becomes private.  This
ensures that there is never a private mapping to a shared page.

Leave a guest.enc_status_change_finish() in place but only use it for
private=>shared conversions.  This will delay updating the TDX metadata
marking the page private until *after* the mapping matches the metadata.
This also ensures that there is never a private mapping to a shared page.

[ dhansen: rewrite changelog ]

Fixes: 7dbde7631629 ("x86/mm/cpa: Add support for TDX shared memory")
Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Dave Hansen 
Reviewed-by: Kuppuswamy Sathyanarayanan 
Link: https://lore.kernel.org/all/20230606095622.1939-3-kirill.shutemov%40linux.intel.com
Signed-off-by: Sasha Levin

x86/mm: Allow guest.enc_status_change_prepare() to fail

2023-07-11T17:39:23+00:00

[ Upstream commit 3f6819dd192ef4f0c568ec3e9d6d408b3fa1ad3d ]

TDX code is going to provide guest.enc_status_change_prepare() that is
able to fail. TDX will use the call to convert the GPA range from shared
to private. This operation can fail.

Add a way to return an error from the callback.

Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Dave Hansen 
Reviewed-by: Kuppuswamy Sathyanarayanan 
Link: https://lore.kernel.org/all/20230606095622.1939-2-kirill.shutemov%40linux.intel.com
Stable-dep-of: 195edce08b63 ("x86/tdx: Fix race between set_memory_encrypted() and load_unaligned_zeropad()")
Signed-off-by: Sasha Levin

x86/sev: Fix calculation of end address based on number of pages

2023-07-11T17:39:21+00:00

[ Upstream commit 5dee19b6b2b194216919b99a1f5af2949a754016 ]

When calculating an end address based on an unsigned int number of pages,
any value greater than or equal to 0x100000 that is shift PAGE_SHIFT bits
results in a 0 value, resulting in an invalid end address. Change the
number of pages variable in various routines from an unsigned int to an
unsigned long to calculate the end address correctly.

Fixes: 5e5ccff60a29 ("x86/sev: Add helper for validating pages in early enc attribute changes")
Fixes: dc3f3d2474b8 ("x86/mm: Validate memory when changing the C-bit")
Signed-off-by: Tom Lendacky 
Signed-off-by: Borislav Petkov (AMD) 
Link: https://lore.kernel.org/r/6a6e4eea0e1414402bac747744984fa4e9c01bb6.1686063086.git.thomas.lendacky@amd.com
Signed-off-by: Sasha Levin