linux.git/arch/x86/include/asm/shared, branch v6.14

x86/tdx: Dump attributes and TD_CTLS on boot

2024-12-05T18:27:07+00:00

Dump TD configuration on boot. Attributes and TD_CTLS define TD
behavior. This information is useful for tracking down bugs.

The output ends up looking like this in practice:

[    0.000000] tdx: Guest detected
[    0.000000] tdx: Attributes: SEPT_VE_DISABLE
[    0.000000] tdx: TD_CTLS: PENDING_VE_DISABLE ENUM_TOPOLOGY VIRT_CPUID2 REDUCE_VE

Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Dave Hansen 
Reviewed-by: Nikolay Borisov 
Link: https://lore.kernel.org/all/20241202072458.447455-1-kirill.shutemov%40linux.intel.com

x86/tdx: Disable unnecessary virtualization exceptions

2024-12-04T21:55:15+00:00

Originally, #VE was defined as the TDX behavior in order to support
paravirtualization of x86 features that can’t be virtualized by the TDX
module. The intention is that if guest software wishes to use such a
feature, it implements some logic to support this. This logic resides in
the #VE exception handler it may work in cooperation with the host VMM.

Theoretically, the guest TD’s #VE handler was supposed to act as a "TDX
enlightenment agent" inside the TD. However, in practice, the #VE
handler is simplistic:

- #VE on CPUID is handled by returning all-0 to the code which
executed CPUID. In many cases, an all-0 value is not the correct
value, and may cause improper operation.

- #VE on RDMSR is handled by requesting the MSR value from the host
VMM. This is prone to security issues since the host VMM is
untrusted. It may also be functionally incorrect in case the
expected operation is to paravirtualize some CPU functionality.

Newer TDX modules provide a "REDUCE_VE" feature. When enabled, it
drastically cuts cases when guests receive #VE on MSR and CPUID
accesses. Basically, instead of punting the problem to the VMM, the
TDX module fills in good data. What the TDX module provides is
obviously highly specific to the MSR or CPUID. This is all spelled
out in excruciating detail in the TDX specs.

Enable REDUCE_VE. Make TDX guest behaviour less odd, and closer to
how a normal CPU behaves.

Note that enabling of the feature doesn't eliminate need in #VE handler
for CPUID and MSR accesses. Some MSRs still generate #VE (notably
APIC-related) and kernel needs CPUID #VE handler to ask VMM for leafs in
hypervisor range.

[ dhansen: changelog tweaks, rename/rework VE reduction function ]

Signed-off-by: Kirill A. Shutemov
Signed-off-by: Dave Hansen
Reviewed-by: Nikolay Borisov
Link: https://lore.kernel.org/all/20241202072431.447380-1-kirill.shutemov%40linux.intel.com

x86/tdx: Enable CPU topology enumeration

2024-11-07T18:27:45+00:00

TDX 1.0 defines baseline behaviour of TDX guest platform. TDX 1.0
generates a #VE when accessing topology-related CPUID leafs (0xB and
0x1F) and the X2APIC_APICID MSR. The kernel returns all zeros on CPUID
topology. In practice, this means that the kernel can only boot with a
plain topology. Any complications will cause problems.

The ENUM_TOPOLOGY feature allows the VMM to provide topology
information to the guest. Enabling the feature eliminates
topology-related #VEs: the TDX module virtualizes accesses to
the CPUID leafs and the MSR.

Enable ENUM_TOPOLOGY if it is available.

Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Dave Hansen 
Acked-by: Kai Huang 
Link: https://lore.kernel.org/all/20241104103803.195705-5-kirill.shutemov%40linux.intel.com

x86/tdx: Dynamically disable SEPT violations from causing #VEs

2024-11-07T18:27:38+00:00

Memory access #VEs are hard for Linux to handle in contexts like the
entry code or NMIs.  But other OSes need them for functionality.
There's a static (pre-guest-boot) way for a VMM to choose one or the
other.  But VMMs don't always know which OS they are booting, so they
choose to deliver those #VEs so the "other" OSes will work.  That,
unfortunately has left us in the lurch and exposed to these
hard-to-handle #VEs.

The TDX module has introduced a new feature. Even if the static
configuration is set to "send nasty #VEs", the kernel can dynamically
request that they be disabled. Once they are disabled, access to private
memory that is not in the Mapped state in the Secure-EPT (SEPT) will
result in an exit to the VMM rather than injecting a #VE.

Check if the feature is available and disable SEPT #VE if possible.

If the TD is allowed to disable/enable SEPT #VEs, the ATTR_SEPT_VE_DISABLE
attribute is no longer reliable. It reflects the initial state of the
control for the TD, but it will not be updated if someone (e.g. bootloader)
changes it before the kernel starts. Kernel must check TDCS_TD_CTLS bit to
determine if SEPT #VEs are enabled or disabled.

[ dhansen: remove 'return' at end of function ]

Fixes: 373e715e31bf ("x86/tdx: Panic on bad configs that #VE on "private" memory access")
Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Dave Hansen 
Acked-by: Kai Huang 
Link: https://lore.kernel.org/all/20241104103803.195705-4-kirill.shutemov%40linux.intel.com

x86/tdx: Introduce wrappers to read and write TD metadata

2024-11-07T18:26:16+00:00

The TDG_VM_WR TDCALL is used to ask the TDX module to change some
TD-specific VM configuration. There is currently only one user in the
kernel of this TDCALL leaf.  More will be added shortly.

Refactor to make way for more users of TDG_VM_WR who will need to modify
other TD configuration values.

Add a wrapper for the TDG_VM_RD TDCALL that requests TD-specific
metadata from the TDX module. There are currently no users for
TDG_VM_RD. Mark it as __maybe_unused until the first user appears.

This is preparation for enumeration and enabling optional TD features.

Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Dave Hansen 
Reviewed-by: Kai Huang 
Reviewed-by: Kuppuswamy Sathyanarayanan 
Link: https://lore.kernel.org/all/20241104103803.195705-2-kirill.shutemov%40linux.intel.com

x86/virt/tdx: Get module global metadata for module initialization

2023-12-08T17:12:18+00:00

The TDX module global metadata provides system-wide information about
the module.

TL;DR:

Use the TDH.SYS.RD SEAMCALL to tell if the module is good or not.

Long Version:

1) Only initialize TDX module with version 1.5 and later

TDX module 1.0 has some compatibility issues with the later versions of
module, as documented in the "Intel TDX module ABI incompatibilities
between TDX1.0 and TDX1.5" spec.  Don't bother with module versions that
do not have a stable ABI.

2) Get the essential global metadata for module initialization

TDX reports a list of "Convertible Memory Region" (CMR) to tell the
kernel which memory is TDX compatible.  The kernel needs to build a list
of memory regions (out of CMRs) as "TDX-usable" memory and pass them to
the TDX module.  The kernel does this by constructing a list of "TD
Memory Regions" (TDMRs) to cover all these memory regions and passing
them to the TDX module.

Each TDMR is a TDX architectural data structure containing the memory
region that the TDMR covers, plus the information to track (within this
TDMR):
  a) the "Physical Address Metadata Table" (PAMT) to track each TDX
     memory page's status (such as which TDX guest "owns" a given page,
     and
  b) the "reserved areas" to tell memory holes that cannot be used as
     TDX memory.

The kernel needs to get below metadata from the TDX module to build the
list of TDMRs:
  a) the maximum number of supported TDMRs
  b) the maximum number of supported reserved areas per TDMR and,
  c) the PAMT entry size for each TDX-supported page size.

== Implementation ==

The TDX module has two modes of fetching the metadata: a one field at
a time, or all in one blob.  Use the field at a time for now.  It is
slower, but there just are not enough fields now to justify the
complexity of extra unpacking.

The err_free_tdxmem=>out_put_tdxmem goto looks wonky by itself.  But
it is the first of a bunch of error handling that will get stuck at
its site.

[ dhansen: clean up changelog and add a struct to map between
	   the TDX module fields and 'struct tdx_tdmr_sysinfo' ]

Signed-off-by: Kai Huang 
Signed-off-by: Dave Hansen 
Link: https://lore.kernel.org/all/20231208170740.53979-8-dave.hansen%40intel.com

x86/virt/tdx: Define TDX supported page sizes as macros

2023-12-08T17:12:00+00:00

TDX supports 4K, 2M and 1G page sizes.  The corresponding values are
defined by the TDX module spec and used as TDX module ABI.  Currently,
they are used in try_accept_one() when the TDX guest tries to accept a
page.  However currently try_accept_one() uses hard-coded magic values.

Define TDX supported page sizes as macros and get rid of the hard-coded
values in try_accept_one().  TDX host support will need to use them too.

Signed-off-by: Kai Huang 
Signed-off-by: Dave Hansen 
Reviewed-by: Kirill A. Shutemov 
Reviewed-by: Dave Hansen 
Reviewed-by: David Hildenbrand 
Link: https://lore.kernel.org/all/20231208170740.53979-2-dave.hansen%40intel.com

Merge tag 'tsm-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux

2023-11-05T01:58:13+00:00

Pull unified attestation reporting from Dan Williams:
 "In an ideal world there would be a cross-vendor standard attestation
  report format for confidential guests along with a common device
  definition to act as the transport.

  In the real world the situation ended up with multiple platform
  vendors inventing their own attestation report formats with the
  SEV-SNP implementation being a first mover to define a custom
  sev-guest character device and corresponding ioctl(). Later, this
  configfs-tsm proposal intercepted an attempt to add a tdx-guest
  character device and a corresponding new ioctl(). It also anticipated
  ARM and RISC-V showing up with more chardevs and more ioctls().

  The proposal takes for granted that Linux tolerates the vendor report
  format differentiation until a standard arrives. From talking with
  folks involved, it sounds like that standardization work is unlikely
  to resolve anytime soon. It also takes the position that kernfs ABIs
  are easier to maintain than ioctl(). The result is a shared configfs
  mechanism to return per-vendor report-blobs with the option to later
  support a standard when that arrives.

  Part of the goal here also is to get the community into the
  "uncomfortable, but beneficial to the long term maintainability of the
  kernel" state of talking to each other about their differentiation and
  opportunities to collaborate. Think of this like the device-driver
  equivalent of the common memory-management infrastructure for
  confidential-computing being built up in KVM.

  As for establishing an "upstream path for cross-vendor
  confidential-computing device driver infrastructure" this is something
  I want to discuss at Plumbers. At present, the multiple vendor
  proposals for assigning devices to confidential computing VMs likely
  needs a new dedicated repository and maintainer team, but that is a
  discussion for v6.8.

  For now, Greg and Thomas have acked this approach and this is passing
  is AMD, Intel, and Google tests.

  Summary:

   - Introduce configfs-tsm as a shared ABI for confidential computing
     attestation reports

   - Convert sev-guest to additionally support configfs-tsm alongside
     its vendor specific ioctl()

   - Added signed attestation report retrieval to the tdx-guest driver
     forgoing a new vendor specific ioctl()

   - Misc cleanups and a new __free() annotation for kvfree()"

* tag 'tsm-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux:
  virt: tdx-guest: Add Quote generation support using TSM_REPORTS
  virt: sevguest: Add TSM_REPORTS support for SNP_GET_EXT_REPORT
  mm/slab: Add __free() support for kvfree
  virt: sevguest: Prep for kernel internal get_ext_report()
  configfs-tsm: Introduce a shared ABI for attestation reports
  virt: coco: Add a coco/Makefile and coco/Kconfig
  virt: sevguest: Fix passing a stack buffer as a scatterlist target

virt: tdx-guest: Add Quote generation support using TSM_REPORTS

2023-10-20T01:12:00+00:00

In TDX guest, the attestation process is used to verify the TDX guest
trustworthiness to other entities before provisioning secrets to the
guest. The first step in the attestation process is TDREPORT
generation, which involves getting the guest measurement data in the
format of TDREPORT, which is further used to validate the authenticity
of the TDX guest. TDREPORT by design is integrity-protected and can
only be verified on the local machine.

To support remote verification of the TDREPORT in a SGX-based
attestation, the TDREPORT needs to be sent to the SGX Quoting Enclave
(QE) to convert it to a remotely verifiable Quote. SGX QE by design can
only run outside of the TDX guest (i.e. in a host process or in a
normal VM) and guest can use communication channels like vsock or
TCP/IP to send the TDREPORT to the QE. But for security concerns, the
TDX guest may not support these communication channels. To handle such
cases, TDX defines a GetQuote hypercall which can be used by the guest
to request the host VMM to communicate with the SGX QE. More details
about GetQuote hypercall can be found in TDX Guest-Host Communication
Interface (GHCI) for Intel TDX 1.0, section titled
"TDG.VP.VMCALL".

Trusted Security Module (TSM) [1] exposes a common ABI for Confidential
Computing Guest platforms to get the measurement data via ConfigFS.
Extend the TSM framework and add support to allow an attestation agent
to get the TDX Quote data (included usage example below).

report=/sys/kernel/config/tsm/report/report0
mkdir $report
dd if=/dev/urandom bs=64 count=1 > $report/inblob
hexdump -C $report/outblob
rmdir $report

GetQuote TDVMCALL requires TD guest pass a 4K aligned shared buffer
with TDREPORT data as input, which is further used by the VMM to copy
the TD Quote result after successful Quote generation. To create the
shared buffer, allocate a large enough memory and mark it shared using
set_memory_decrypted() in tdx_guest_init(). This buffer will be re-used
for GetQuote requests in the TDX TSM handler.

Although this method reserves a fixed chunk of memory for GetQuote
requests, such one time allocation can help avoid memory fragmentation
related allocation failures later in the uptime of the guest.

Since the Quote generation process is not time-critical or frequently
used, the current version uses a polling model for Quote requests and
it also does not support parallel GetQuote requests.

Link: https://lore.kernel.org/lkml/169342399185.3934343.3035845348326944519.stgit@dwillia2-xfh.jf.intel.com/ [1]
Signed-off-by: Kuppuswamy Sathyanarayanan
Reviewed-by: Erdem Aktas
Tested-by: Kuppuswamy Sathyanarayanan
Tested-by: Peter Gonda
Reviewed-by: Tom Lendacky
Signed-off-by: Dan Williams

x86/tdx: Fix noreturn build warning around tdx_hypercall_failed()

2023-09-18T07:11:39+00:00

LKP reported below build warning:

  vmlinux.o: warning: objtool: __tdx_hypercall+0x128: __tdx_hypercall_failed() is missing a __noreturn annotation

The __tdx_hypercall_failed() function definition already has __noreturn
annotation, but it turns out the __noreturn must be annotated to the
function declaration.

PeterZ explains:

  "FWIW, the reason being that...

   The point of noreturn is that the caller should know to stop generating
   code. For that the declaration needs the attribute, because call sites
   typically do not have access to the function definition in C."

Add __noreturn annotation to the declaration of __tdx_hypercall_failed()
to fix.  It's not a bad idea to document the __noreturn nature at the
definition site either, so keep the annotation at the definition.

Note  is also included by TDX related assembly files.
Include  only in case of !__ASSEMBLY__
otherwise compiling assembly file would trigger build error.

Also, following the objtool documentation, add __tdx_hypercall_failed()
to "tools/objtool/noreturns.h".

Fixes: c641cfb5c157 ("x86/tdx: Make TDX_HYPERCALL asm similar to TDX_MODULE_CALL")
Reported-by: kernel test robot 
Signed-off-by: Kai Huang 
Signed-off-by: Ingo Molnar 
Link: https://lore.kernel.org/r/20230918041858.331234-1-kai.huang@intel.com
Closes: https://lore.kernel.org/oe-kbuild-all/202309140828.9RdmlH2Z-lkp@intel.com/