summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2026-01-08KVM: SVM: Virtualize and advertise support for ERAPSAmit Shah
AMD CPUs with the Enhanced Return Address Predictor Security (ERAPS) feature (available on Zen5+) obviate the need for FILL_RETURN_BUFFER sequences right after VMEXITs. ERAPS adds guest/host tags to entries in the RSB (a.k.a. RAP). This helps with speculation protection across the VM boundary, and it also preserves host and guest entries in the RSB that can improve software performance (which would otherwise be flushed due to the FILL_RETURN_BUFFER sequences). Importantly, ERAPS also improves cross-domain security by clearing the RAP in certain situations. Specifically, the RAP is cleared in response to actions that are typically tied to software context switching between tasks. Per the APM: The ERAPS feature eliminates the need to execute CALL instructions to clear the return address predictor in most cases. On processors that support ERAPS, return addresses from CALL instructions executed in host mode are not used in guest mode, and vice versa. Additionally, the return address predictor is cleared in all cases when the TLB is implicitly invalidated and in the following cases: • MOV CR3 instruction • INVPCID other than single address invalidation (operation type 0) ERAPS also allows CPUs to extends the size of the RSB/RAP from the older standard (of 32 entries) to a new size, enumerated in CPUID leaf 0x80000021:EBX bits 23:16 (64 entries in Zen5 CPUs). In hardware, ERAPS is always-on, when running in host context, the CPU uses the full RSB/RAP size without any software changes necessary. However, when running in guest context, the CPU utilizes the full size of the RSB/RAP if and only if the new ALLOW_LARGER_RAP flag is set in the VMCB; if the flag is not set, the CPU limits itself to the historical size of 32 entires. Requiring software to opt-in for guest usage of RAPs larger than 32 entries allows hypervisors, i.e. KVM, to emulate the aforementioned conditions in which the RAP is cleared as well as the guest/host split. E.g. if the CPU unconditionally used the full RAP for guests, failure to clear the RAP on transitions between L1 or L2, or on emulated guest TLB flushes, would expose the guest to RAP-based attacks as a guest without support for ERAPS wouldn't know that its FILL_RETURN_BUFFER sequence is insufficient. Address the ~two broad categories of ERAPS emulation, and advertise ERAPS support to userspace, along with the RAP size enumerated in CPUID. 1. Architectural RAP clearing: as above, CPUs with ERAPS clear RAP entries on several conditions, including CR3 updates. To handle scenarios where a relevant operation is handled in common code (emulation of INVPCID and to a lesser extent MOV CR3), piggyback VCPU_EXREG_CR3 and create an alias, VCPU_EXREG_ERAPS. SVM doesn't utilize CR3 dirty tracking, and so for all intents and purposes VCPU_EXREG_CR3 is unused. Aliasing VCPU_EXREG_ERAPS ensures that any flow that writes CR3 will also clear the guest's RAP, and allows common x86 to mark ERAPS vCPUs as needing a RAP clear without having to add a new request (or other mechanism). 2. Nested guests: the ERAPS feature adds host/guest tagging to entries in the RSB, but does not distinguish between the guest ASIDs. To prevent the case of an L2 guest poisoning the RSB to attack the L1 guest, the CPU exposes a new VMCB bit (CLEAR_RAP). The next VMRUN with a VMCB that has this bit set causes the CPU to flush the RSB before entering the guest context. Set the bit in VMCB01 after a nested #VMEXIT to ensure the next time the L1 guest runs, its RSB contents aren't polluted by the L2's contents. Similarly, before entry into a nested guest, set the bit for VMCB02, so that the L1 guest's RSB contents are not leaked/used in the L2 context. Enable ALLOW_LARGER_RAP (and emulate RAP clears) if and only if ERAPS is exposed to the guest. Enabling ALLOW_LARGER_RAP unconditionally wouldn't cause any functional issues, but ignoring userspace's (and L1's) desires would put KVM into a grey area, which is especially undesirable due to the potential security implications. E.g. if a use case wants to have L1 do manual RAP clearing even when ERAPS is present in hardware, enabling ALLOW_LARGER_RAP could result in L1 leaving stale entries in the RAP. ERAPS is documented in AMD APM Vol 2 (Pub 24593), in revisions 3.43 and later. Signed-off-by: Amit Shah <amit.shah@amd.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Amit Shah <amit.shah@amd.com> Link: https://patch.msgid.link/aR913X8EqO6meCqa@google.com
2026-01-08KVM: SVM: Don't allow L1 intercepts for instructions not advertisedKevin Cheng
If a feature is not advertised in the guest's CPUID, prevent L1 from intercepting the unsupported instructions by clearing the corresponding intercept in KVM's cached vmcb12. When an L2 guest executes an instruction that is not advertised to L1, we expect a #UD exception to be injected by L0. However, the nested svm exit handler first checks if the instruction intercept is set in vmcb12, and if so, synthesizes an exit from L2 to L1 instead of a #UD exception. If a feature is not advertised, the L1 intercept should be ignored. While creating KVM's cached vmcb12, sanitize the intercepts for instructions that are not advertised in the guest CPUID. This effectively ignores the L1 intercept on nested vm exit handling. It also ignores the L1 intercept when computing the intercepts in vmcb02, so if L0 (for some reason) does not intercept the instruction, KVM won't intercept it at all. Signed-off-by: Kevin Cheng <chengkev@google.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251215192510.2300816-1-chengkev@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: SVM: Add support for expedited writes to the fast MMIO busSean Christopherson
Wire up SVM's #NPF handler to fast MMIO. While SVM doesn't provide a dedicated exit reason, it's trivial to key off PFERR_RSVD_MASK. Like VMX, restrict the fast path to L1 to avoid having to deal with nGPA=>GPA translations. For simplicity, use the fast path if and only if the next RIP is known. While KVM could utilize EMULTYPE_SKIP, doing so would require additional logic to deal with SEV guests, e.g. to go down the slow path if the instruction buffer is empty. All modern CPUs support next RIP, and in practice the next RIP will be available for any guest fast path. Copy+paste the kvm_io_bus_write() + trace_kvm_fast_mmio() logic even though KVM would ideally provide a small helper, as such a helper would need to either be a macro or non-inline to avoid including trace.h in a header (trace.h must not be included by x86.c prior to CREATE_TRACE_POINTS being defined). Link: https://patch.msgid.link/20251113221642.1673023-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: SVM: Rename "fault_address" to "gpa" in npf_interception()Sean Christopherson
Rename "fault_address" to "gpa" in KVM's #NPF handler and track it as a gpa_t to more precisely document what type of address is being captured, and because "gpa" is much more succinct. No functional change intended. Link: https://patch.msgid.link/20251113221642.1673023-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: nSVM: Remove a user-triggerable WARN on nested_svm_load_cr3() succeedingSean Christopherson
Drop the WARN in svm_set_nested_state() on nested_svm_load_cr3() failing as it is trivially easy to trigger from userspace by modifying CPUID after loading CR3. E.g. modifying the state restoration selftest like so: --- tools/testing/selftests/kvm/x86/state_test.c +++ tools/testing/selftests/kvm/x86/state_test.c @@ -280,7 +280,16 @@ int main(int argc, char *argv[]) /* Restore state in a new VM. */ vcpu = vm_recreate_with_one_vcpu(vm); - vcpu_load_state(vcpu, state); + + if (stage == 4) { + state->sregs.cr3 = BIT(44); + vcpu_load_state(vcpu, state); + + vcpu_set_cpuid_property(vcpu, X86_PROPERTY_MAX_PHY_ADDR, 36); + __vcpu_nested_state_set(vcpu, &state->nested); + } else { + vcpu_load_state(vcpu, state); + } /* * Restore XSAVE state in a dummy vCPU, first without doing generates: WARNING: CPU: 30 PID: 938 at arch/x86/kvm/svm/nested.c:1877 svm_set_nested_state+0x34a/0x360 [kvm_amd] Modules linked in: kvm_amd kvm irqbypass [last unloaded: kvm] CPU: 30 UID: 1000 PID: 938 Comm: state_test Tainted: G W 6.18.0-rc7-58e10b63777d-next-vm Tainted: [W]=WARN Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:svm_set_nested_state+0x34a/0x360 [kvm_amd] Call Trace: <TASK> kvm_arch_vcpu_ioctl+0xf33/0x1700 [kvm] kvm_vcpu_ioctl+0x4e6/0x8f0 [kvm] __x64_sys_ioctl+0x8f/0xd0 do_syscall_64+0x61/0xad0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Simply delete the WARN instead of trying to prevent userspace from shoving "illegal" state into CR3. For better or worse, KVM's ABI allows userspace to set CPUID after SREGS, and vice versa, and KVM is very permissive when it comes to guest CPUID. I.e. attempting to enforce the virtual CPU model when setting CPUID could break userspace. Given that the WARN doesn't provide any meaningful protection for KVM or benefit for userspace, simply drop it even though the odds of breaking userspace are minuscule. Opportunistically delete a spurious newline. Fixes: b222b0b88162 ("KVM: nSVM: refactor the CR3 reload on migration") Cc: stable@vger.kernel.org Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251216161755.1775409-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08arm64: dts: apple: s8001: Add DWI backlight for J98a, J99aNick Chan
iPad Pro (12.9-inch) uses DWI backlight, while the 9.7-inch model does not. Add DWI backlight node for s8001 and enable it for J98a and J99a. Signed-off-by: Nick Chan <towinchenmi@gmail.com> Link: https://patch.msgid.link/20251231-b4-j98a-j99a-dwi-bl-v1-1-24793c2b99fc@gmail.com Signed-off-by: Sven Peter <sven@kernel.org>
2026-01-08spi: Simplify devm_spi_*_controller()Andy Shevchenko
Use devm_add_action_or_reset() instead of devres_alloc() and devres_add(), which works the same. This will simplify the code. There is no functional changes. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://patch.msgid.link/20260108175145.3535441-1-andriy.shevchenko@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
2026-01-08spi: microchip-core: use XOR instead of ANDNOT to fix the logicAndy Shevchenko
Use XOR instead of ANDNOT to fix the logic. The current approach with (foo & BAR & ~baz) is harder to process, and it proved to be wrong, than more usual pattern for the comparing misconfiguration using ((foo ^ baz) & BAR) which can be read as "find all different bits between foo and baz that are related to BAR (mask)". Besides that it makes the binary code shorter. Function old new delta mchp_corespi_setup 103 99 -4 Fixes: 059f545832be ("spi: add support for microchip "soft" spi controller") Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Tested-by: Prajna Rajendra Kumar <prajna.rajendrakumar@microchip.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://patch.msgid.link/20260108175100.3535306-1-andriy.shevchenko@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
2026-01-08KVM: x86: Don't read guest CR3 when doing async pf while the MMU is directXiaoyao Li
Don't read guest CR3 in kvm_arch_setup_async_pf() if the MMU is direct and use INVALID_GPA instead. When KVM tries to perform the host-only async page fault for the shared memory of TDX guests, the following WARNING is triggered: WARNING: CPU: 1 PID: 90922 at arch/x86/kvm/vmx/main.c:483 vt_cache_reg+0x16/0x20 Call Trace: __kvm_mmu_faultin_pfn kvm_mmu_faultin_pfn kvm_tdp_page_fault kvm_mmu_do_page_fault kvm_mmu_page_fault tdx_handle_ept_violation This WARNING is triggered when calling kvm_mmu_get_guest_pgd() to cache the guest CR3 in kvm_arch_setup_async_pf() for later use in kvm_arch_async_page_ready() to determine if it's possible to fix the page fault in the current vCPU context to save one VM exit. However, when guest state is protected, KVM cannot read the guest CR3. Since protected guests aren't compatible with shadow paging, i.e, they must use direct MMU, avoid calling kvm_mmu_get_guest_pgd() to read guest CR3 when the MMU is direct and use INVALID_GPA instead. Note that for protected guests mmu->root_role.direct is always true, so that kvm_mmu_get_guest_pgd() in kvm_arch_async_page_ready() won't be reached. Reported-by: Farrah Chen <farrah.chen@intel.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://patch.msgid.link/20251212135051.2155280-1-xiaoyao.li@intel.com [sean: explicitly cast to "unsigned long" to make 32-bit builds happy] Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: selftests: Rename vm_get_page_table_entry() to vm_get_pte()Sean Christopherson
Shorten the API to get a PTE as the "PTE" acronym is ubiquitous, and the "page table entry" makes it unnecessarily difficult to quickly understand what callers are doing. No functional change intended. Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251230230150.4150236-21-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: selftests: Extend memstress to run on nested SVMYosry Ahmed
Add L1 SVM code and generalize the setup code to work for both VMX and SVM. This allows running 'dirty_log_perf_test -n' on AMD CPUs. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251230230150.4150236-20-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: selftests: Extend vmx_dirty_log_test to cover SVMYosry Ahmed
Generalize the code in vmx_dirty_log_test.c by adding SVM-specific L1 code, doing some renaming (e.g. EPT -> TDP), and having setup code for both SVM and VMX in test_dirty_log(). Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251230230150.4150236-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: selftests: Set the user bit on nested NPT PTEsYosry Ahmed
According to the APM, NPT walks are treated as user accesses. In preparation for supporting NPT mappings, set the 'user' bit on NPTs by adding a mask of bits to always be set on PTEs in kvm_mmu. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251230230150.4150236-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: selftests: Add support for nested NPTsYosry Ahmed
Implement nCR3 and NPT initialization functions, similar to the EPT equivalents, and create common TDP helpers for enablement checking and initialization. Enable NPT for nested guests by default if the TDP MMU was initialized, similar to VMX. Reuse the PTE masks from the main MMU in the NPT MMU, except for the C and S bits related to confidential VMs. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251230230150.4150236-17-seanjc@google.com [sean: apply Yosry's fixup for ncr3_gpa] Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: selftests: Allow kvm_cpu_has_ept() to be called on AMD CPUsYosry Ahmed
In preparation for generalizing the nested dirty logging test, checking if either EPT or NPT is enabled will be needed. To avoid needing to gate the kvm_cpu_has_ept() call by the CPU type, make sure the function returns false if VMX is not available instead of trying to read VMX-only MSRs. No functional change intended. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251230230150.4150236-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: selftests: Move TDP mapping functions outside of vmx.cSean Christopherson
Now that the functions are no longer VMX-specific, move them to processor.c. Do a minor comment tweak replacing 'EPT' with 'TDP'. No functional change intended. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251230230150.4150236-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: selftests: Reuse virt mapping functions for nested EPTsYosry Ahmed
Rework tdp_map() and friends to use __virt_pg_map() and drop the custom EPT code in __tdp_pg_map() and tdp_create_pte(). The EPT code and __virt_pg_map() are practically identical, the main differences are: - EPT uses the EPT struct overlay instead of the PTE masks. - EPT always assumes 4-level EPTs. To reuse __virt_pg_map(), extend the PTE masks to work with EPT's RWX and X-only capabilities, and provide a tdp_mmu_init() API so that EPT can pass in the EPT PTE masks along with the root page level (which is currently hardcoded to '4'). Don't reuse KVM's insane overloading of the USER bit for EPT_R as there's no reason to multiplex bits in the selftests, e.g. selftests aren't trying to shadow guest PTEs and thus don't care about funnelling protections into a common permissions check. Another benefit of reusing the code is having separate handling for upper-level PTEs vs 4K PTEs, which avoids some quirks like setting the large bit on a 4K PTE in the EPTs. For all intents and purposes, no functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20251230230150.4150236-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: selftests: Add a stage-2 MMU instance to kvm_vmSean Christopherson
Add a stage-2 MMU instance so that architectures that support nested virtualization (more specifically, nested stage-2 page tables) can create and track stage-2 page tables for running L2 guests. Plumb the structure into common code to avoid cyclical dependencies, and to provide some line of sight to having common APIs for creating stage-2 mappings. As a bonus, putting the member in common code justifies using stage2_mmu instead of tdp_mmu for x86. Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251230230150.4150236-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: selftests: Stop passing VMX metadata to TDP mapping functionsYosry Ahmed
The root GPA is now retrieved from the nested MMU, stop passing VMX metadata. This is in preparation for making these functions work for NPTs as well. Opportunistically drop tdp_pg_map() since it's unused. No functional change intended. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251230230150.4150236-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: selftests: Use a TDP MMU to share EPT page tables between vCPUsYosry Ahmed
prepare_eptp() currently allocates new EPTs for each vCPU. memstress has its own hack to share the EPTs between vCPUs. Currently, there is no reason to have separate EPTs for each vCPU, and the complexity is significant. The only reason it doesn't matter now is because memstress is the only user with multiple vCPUs. Add vm_enable_ept() to allocate EPT page tables for an entire VM, and use it everywhere to replace prepare_eptp(). Drop 'eptp' and 'eptp_hva' from 'struct vmx_pages' as they serve no purpose (e.g. the EPTP can be built from the PGD), but keep 'eptp_gpa' so that the MMU structure doesn't need to be passed in along with vmx_pages. Dynamically allocate the TDP MMU structure to avoid a cyclical dependency between kvm_util_arch.h and kvm_util.h. Remove the workaround in memstress to copy the EPT root between vCPUs since that's now the default behavior. Name the MMU tdp_mmu instead of e.g. nested_mmu or nested.mmu to avoid recreating the same mess that KVM has with respect to "nested" MMUs, e.g. does nested refer to the stage-2 page tables created by L1, or the stage-1 page tables created by L2? Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20251230230150.4150236-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: selftests: Move PTE bitmasks to kvm_mmuYosry Ahmed
Move the PTE bitmasks into kvm_mmu to parameterize them for virt mapping functions. Introduce helpers to read/write different PTE bits given a kvm_mmu. Drop the 'global' bit definition as it's currently unused, but leave the 'user' bit as it will be used in coming changes. Opportunisitcally rename 'large' to 'huge' as it's more consistent with the kernel naming. Leave PHYSICAL_PAGE_MASK alone, it's fixed in all page table formats and a lot of other macros depend on it. It's tempting to move all the other macros to be per-struct instead, but it would be too much noise for little benefit. Keep c_bit and s_bit in vm->arch as they used before the MMU is initialized, through __vmcreate() -> vm_userspace_mem_region_add() -> vm_mem_add() -> vm_arch_has_protected_memory(). No functional change intended. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> [sean: rename accessors to is_<adjective>_pte()] Link: https://patch.msgid.link/20251230230150.4150236-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: selftests: Add a "struct kvm_mmu_arch arch" member to kvm_mmuSean Christopherson
Add an arch structure+field in "struct kvm_mmu" so that architectures can track arch-specific information for a given MMU. No functional change intended. Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251230230150.4150236-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: selftests: Plumb "struct kvm_mmu" into x86's MMU APIsSean Christopherson
In preparation for generalizing the x86 virt mapping APIs to work with TDP (stage-2) page tables, plumb "struct kvm_mmu" into all of the helper functions instead of operating on vm->mmu directly. Opportunistically swap the order of the check in virt_get_pte() to first assert that the parent is the PGD, and then check that the PTE is present, as it makes more sense to check if the parent PTE is the PGD/root (i.e. not a PTE) before checking that the PTE is PRESENT. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> [sean: rebase on common kvm_mmu structure, rewrite changelog] Link: https://patch.msgid.link/20251230230150.4150236-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: selftests: Add "struct kvm_mmu" to track a given MMU instanceSean Christopherson
Add a "struct kvm_mmu" to track a given MMU instance, e.g. a VM's stage-1 MMU versus a VM's stage-2 MMU, so that x86 can share MMU functionality for both stage-1 and stage-2 MMUs, without creating the potential for subtle bugs, e.g. due to consuming on vm->pgtable_levels when operating a stage-2 MMU. Encapsulate the existing de facto MMU in "struct kvm_vm", e.g instead of burying the MMU details in "struct kvm_vm_arch", to avoid more #ifdefs in ____vm_create(), and in the hopes that other architectures can utilize the formalized MMU structure if/when they too support stage-2 page tables. No functional change intended. Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251230230150.4150236-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: selftests: Stop setting A/D bits when creating EPT PTEsYosry Ahmed
Stop setting Accessed/Dirty bits when creating EPT entries for L2 so that the stage-1 and stage-2 (a.k.a. TDP) page table APIs can use common code without bleeding the EPT hack into the common APIs. While commit 094444204570 ("selftests: kvm: add test for dirty logging inside nested guests") is _very_ light on details, the most likely explanation is that vmx_dirty_log_test was attempting to avoid taking an EPT Violation on the first _write_ from L2. static void l2_guest_code(u64 *a, u64 *b) { READ_ONCE(*a); WRITE_ONCE(*a, 1); <=== GUEST_SYNC(true); ... } When handling read faults in the shadow MMU, KVM opportunistically creates a writable SPTE if the mapping can be writable *and* the gPTE is dirty (or doesn't support the Dirty bit), i.e. if KVM doesn't need to intercept writes in order to emulate Dirty-bit updates. By setting A/D bits in the test's EPT entries, the above READ+WRITE will fault only on the read, and in theory expose the bug fixed by KVM commit 1f4e5fc83a42 ("KVM: x86: fix nested guest live migration with PML"). If the Dirty bit is NOT set, the test will get a false pass due; though again, in theory. However, the test is flawed (and always was, at least in the versions posted publicly), as KVM (correctly) marks the corresponding L1 GFN as dirty (in the dirty bitmap) when creating the writable SPTE. I.e. without a check on the dirty bitmap after the READ_ONCE(), the check after the first WRITE_ONCE() will get a false pass due to the dirty bitmap/log having been updated by the read fault, not by PML. Furthermore, the subsequent behavior in the test's l2_guest_code() effectively hides the flawed test behavior, as the straight writes to a new L2 GPA fault also trigger the KVM bug, and so the test will still detect the failure due to lack of isolation between the two testcases (Read=>Write vs. Write=>Write). WRITE_ONCE(*b, 1); GUEST_SYNC(true); WRITE_ONCE(*b, 1); GUEST_SYNC(true); GUEST_SYNC(false); Punt on fixing vmx_dirty_log_test for the moment as it will be easier to properly fix the test once the TDP code uses the common MMU APIs, at which point it will be trivially easy for the test to retrieve the EPT PTE and set the Dirty bit as needed. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> [sean: rewrite changelog to explain the situation] Link: https://patch.msgid.link/20251230230150.4150236-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: selftests: Kill eptPageTablePointerYosry Ahmed
Replace the struct overlay with explicit bitmasks, which is clearer and less error-prone. See commit f18b4aebe107 ("kvm: selftests: do not use bitfields larger than 32-bits for PTEs") for an example of why bitfields are not preferable. Remove the unused PAGE_SHIFT_4K definition while at it. No functional change intended. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251230230150.4150236-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: selftests: Rename nested TDP mapping functionsYosry Ahmed
Rename the functions from nested_* to tdp_* to make their purpose clearer. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251230230150.4150236-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: selftests: Stop passing a memslot to nested_map_memslot()Yosry Ahmed
On x86, KVM selftests use memslot 0 for all the default regions used by the test infrastructure. This is an implementation detail. nested_map_memslot() is currently used to map the default regions by explicitly passing slot 0, which leaks the library implementation into the caller. Rename the function to a very verbose nested_identity_map_default_memslots() to reflect what it actually does. Add an assertion that only memslot 0 is being used so that the implementation does not change from under us. No functional change intended. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251230230150.4150236-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: selftests: Make __vm_get_page_table_entry() staticYosry Ahmed
The function is only used in processor.c, drop the declaration in processor.h and make it static. No functional change intended. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20251230230150.4150236-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: selftests: Fix sign extension bug in get_desc64_base()MJ Pooladkhay
The function get_desc64_base() performs a series of bitwise left shifts on fields of various sizes. More specifically, when performing '<< 24' on 'desc->base2' (which is a u8), 'base2' is promoted to a signed integer before shifting. In a scenario where base2 >= 0x80, the shift places a 1 into bit 31, causing the 32-bit intermediate value to become negative. When this result is cast to uint64_t or ORed into the return value, sign extension occurs, corrupting the upper 32 bits of the address (base3). Example: Given: base0 = 0x5000 base1 = 0xd6 base2 = 0xf8 base3 = 0xfffffe7c Expected return: 0xfffffe7cf8d65000 Actual return: 0xfffffffff8d65000 Fix this by explicitly casting the fields to 'uint64_t' before shifting to prevent sign extension. Signed-off-by: MJ Pooladkhay <mj@pooladkhay.com> Link: https://patch.msgid.link/20251222174207.107331-1-mj@pooladkhay.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: x86: Return "unsupported" instead of "invalid" on access to unsupported ↵Sean Christopherson
PV MSR Return KVM_MSR_RET_UNSUPPORTED instead of '1' (which for all intents and purposes means "invalid") when rejecting accesses to KVM PV MSRs to adhere to KVM's ABI of allowing host reads and writes of '0' to MSRs that are advertised to userspace via KVM_GET_MSR_INDEX_LIST, even if the vCPU model doesn't support the MSR. E.g. running a QEMU VM with -cpu host,-kvmclock,kvm-pv-enforce-cpuid yields: qemu: error: failed to set MSR 0x12 to 0x0 qemu: target/i386/kvm/kvm.c:3301: kvm_buf_set_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed. Fixes: 66570e966dd9 ("kvm: x86: only provide PV features if enabled in guest's CPUID") Cc: stable@vger.kernel.org Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://patch.msgid.link/20251230205948.4094097-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: nVMX: Mark APIC access page dirty when syncing vmcs12 pagesFred Griffoul
For consistency with commit 7afe79f5734a ("KVM: nVMX: Mark vmcs12's APIC access page dirty when unmapping"), which marks the page dirty during unmap operations, also mark it dirty during vmcs12 page synchronization. Signed-off-by: Fred Griffoul <fgriffo@amazon.co.uk> [sean: use kvm_vcpu_map_mark_dirty()] Link: https://patch.msgid.link/20251121223444.355422-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: VMX: Move nested_mark_vmcs12_pages_dirty() to vmx.c, and renameSean Christopherson
Move nested_mark_vmcs12_pages_dirty() to vmx.c now that it's only used in the VM-Exit path, and add "all" to its name to document that its purpose is to mark all (mapped-out-of-band) vmcs12 pages as dirty. No functional change intended. Link: https://patch.msgid.link/20251121223444.355422-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: nVMX: Precisely mark vAPIC and PID maps dirty when delivering nested PISean Christopherson
Explicitly mark the vmcs12 vAPIC and PI descriptor pages as dirty when delivering a nested posted interrupt instead of marking all vmcs12 pages as dirty. This will allow marking the APIC access page (and any future vmcs12 pages) as dirty in nested_mark_vmcs12_pages_dirty() without over- dirtying in the nested PI case. Manually marking the vAPIC and PID pages as dirty also makes the flow a bit more self-documenting, e.g. it's not obvious at first glance that vmx->nested.pi_desc is actually a host kernel mapping of a vmcs12 page. No functional change intended. Link: https://patch.msgid.link/20251121223444.355422-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: x86: Mark vmcs12 pages as dirty if and only if they're mappedSean Christopherson
Mark vmcs12 pages as dirty (in KVM's dirty log bitmap) if and only if the page is mapped, i.e. if the page is actually "active" in vmcs02. For some pages, KVM simply disables the associated VMCS control if the vmcs12 page is unreachable, i.e. it's possible for nested VM-Enter to succeed with a "bad" vmcs12 page. Link: https://patch.msgid.link/20251121223444.355422-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: Use vCPU specific memslots in __kvm_vcpu_map()Sean Christopherson
When establishing a "host access map", lookup the gfn in the vCPU specific memslots, as the intent is that the mapping will be established for the current vCPU context. Specifically, using __kvm_vcpu_map() in x86's SMM context should create mappings based on the SMM memslots, not the non-SMM memslots. Luckily, the bug is benign as x86 is the only architecture with multiple memslot address spaces, and all of x86's usage is limited to non-SMM. The calls in (or reachable by) {svm,vmx}_enter_smm() are made before enter_smm() sets HF_SMM_MASK, and the calls in {svm,vmx}_leave_smm() are made after emulator_leave_smm() clears HF_SMM_MASK. Note, kvm_vcpu_unmap() uses the vCPU specific memslots, only the map() side of things is broken. Fixes: 357a18ad230f ("KVM: Kill kvm_map_gfn() / kvm_unmap_gfn() and gfn_to_pfn_cache") Link: https://patch.msgid.link/20251121223444.355422-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08MAINTAINERS: add docs and selftest to the TLS file listJakub Kicinski
The TLS MAINTAINERS entry does not seem to cover the selftest or docs. Add those. While at it remove the unnecessary wildcard from net/tls/, there are no subdirectories anyway so this change has no impact today. Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20260106200706.1596250-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-08KVM: VMX: Add mediated PMU support for CPUs without "save perf global ctrl"Sean Christopherson
Extend mediated PMU support for Intel CPUs without support for saving PERF_GLOBAL_CONTROL into the guest VMCS field on VM-Exit, e.g. for Skylake and its derivatives, as well as Icelake. While supporting CPUs without VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL isn't completely trivial, it's not that complex either. And not supporting such CPUs would mean not supporting 7+ years of Intel CPUs released in the past 10 years. On VM-Exit, immediately propagate the saved PERF_GLOBAL_CTRL to the VMCS as well as KVM's software cache so that KVM doesn't need to add full EXREG tracking of PERF_GLOBAL_CTRL. In practice, the vast majority of VM-Exits won't trigger software writes to guest PERF_GLOBAL_CTRL, so deferring the VMWRITE to the next VM-Enter would only delay the inevitable without batching/avoiding VMWRITEs. Note! Take care to refresh VM_EXIT_MSR_STORE_COUNT on nested VM-Exit, as it's unfortunately possible that KVM could recalculate MSR intercepts while L2 is active, e.g. if userspace loads nested state and _then_ sets PERF_CAPABILITIES. Eating the VMWRITE on every nested VM-Exit is unfortunate, but that's a pre-existing problem and can/should be solved separately, e.g. modifying the number of auto-load entries while L2 is active is also uncommon on modern CPUs. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-45-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: VMX: Initialize vmcs01.VM_EXIT_MSR_STORE_ADDR with list addressSean Christopherson
Initialize vmcs01.VM_EXIT_MSR_STORE_ADDR to point at the vCPU's msr_autostore list in anticipation of utilizing the auto-store functionality, and to harden KVM against stray reads to pfn 0 (or, in theory, a random pfn if the underlying CPU uses a complex scheme for encoding VMCS data). The MSR auto lists are supposed to be ignored if the associated COUNT VMCS field is '0', but leaving the ADDR field zero-initialized in memory is an unnecessary risk (albeit a minuscule risk) given that the cost is a single VMWRITE during vCPU creation. Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-44-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: VMX: Dedup code for adding MSR to VMCS's auto listSean Christopherson
Add a helper to add an MSR to a VMCS's "auto" list to deduplicate the code in add_atomic_switch_msr(), and so that the functionality can be used in the future for managing the MSR auto-store list. No functional change intended. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-43-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: VMX: Compartmentalize adding MSRs to host vs. guest auto-load listSean Christopherson
Undo the bundling of the "host" and "guest" MSR auto-load list logic so that the code can be deduplicated by factoring out the logic to a separate helper. Now that "list full" situations are treated as fatal to the VM, there is no need to pre-check both lists. For all intents and purposes, this reverts the add_atomic_switch_msr() changes made by commit 3190709335dd ("x86/KVM/VMX: Separate the VMX AUTOLOAD guest/host number accounting"). Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-42-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: VMX: Set MSR index auto-load entry if and only if entry is "new"Sean Christopherson
When adding an MSR to the auto-load lists, update the MSR index in the list entry if and only if a new entry is being inserted, as 'i' can only be non-negative if vmx_find_loadstore_msr_slot() found an entry with the MSR's index. Unnecessarily setting the index is benign, but it makes it harder to see that updating the value is necessary even when an existing entry for the MSR was found. No functional change intended. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-41-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: VMX: Bug the VM if either MSR auto-load list is fullSean Christopherson
WARN and bug the VM if either MSR auto-load list is full when adding an MSR to the lists, as the set of MSRs that KVM loads via the lists is finite and entirely KVM controlled, i.e. overflowing the lists shouldn't be possible in a fully released version of KVM. Terminate the VM as the core KVM infrastructure has no insight as to _why_ an MSR is being added to the list, and failure to load an MSR on VM-Enter and/or VM-Exit could be fatal to the host. E.g. running the host with a guest-controlled PEBS MSR could generate unexpected writes to the DS buffer and crash the host. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-40-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: VMX: Drop unused @entry_only param from add_atomic_switch_msr()Sean Christopherson
Drop the "on VM-Enter only" parameter from add_atomic_switch_msr() as it is no longer used, and for all intents and purposes was never used. The functionality was added, under embargo, by commit 989e3992d2ec ("x86/KVM/VMX: Extend add_atomic_switch_msr() to allow VMENTER only MSRs"), and then ripped out by commit 2f055947ae5e ("x86/kvm: Drop L1TF MSR list approach") just a few commits later. 2f055947ae5e x86/kvm: Drop L1TF MSR list approach 72c6d2db64fa x86/litf: Introduce vmx status variable 215af5499d9e cpu/hotplug: Online siblings when SMT control is turned on 390d975e0c4e x86/KVM/VMX: Use MSR save list for IA32_FLUSH_CMD if required 989e3992d2ec x86/KVM/VMX: Extend add_atomic_switch_msr() to allow VMENTER only MSRs Furthermore, it's extremely unlikely KVM will ever _need_ to load an MSR value via the auto-load lists only on VM-Enter. MSRs writes via the lists aren't optimized in any way, and so the only reason to use the lists instead of a WRMSR are for cases where the MSR _must_ be load atomically with respect to VM-Enter (and/or VM-Exit). While one could argue that command MSRs, e.g. IA32_FLUSH_CMD, "need" to be done exact at VM-Enter, in practice doing such flushes within a few instructons of VM-Enter is more than sufficient. Note, the shortlog and changelog for commit 390d975e0c4e ("x86/KVM/VMX: Use MSR save list for IA32_FLUSH_CMD if required") are misleading and wrong. That commit added MSR_IA32_FLUSH_CMD to the VM-Enter _load_ list, not the VM-Enter save list (which doesn't exist, only VM-Exit has a store/save list). Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-39-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: VMX: Dedup code for removing MSR from VMCS's auto-load listSean Christopherson
Add a helper to remove an MSR from an auto-{load,store} list to dedup the msr_autoload code, and in anticipation of adding similar functionality for msr_autostore. No functional change intended. Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-38-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: nVMX: Don't update msr_autostore count when saving TSC for vmcs12Sean Christopherson
Rework nVMX's use of the MSR auto-store list to snapshot TSC to sneak MSR_IA32_TSC into the list _without_ updating KVM's software tracking, and drop the generic functionality so that future usage of the store list for nested specific logic needs to consider the implications of modifying the list. Updating the list only for vmcs02 and only on nested VM-Enter is a disaster waiting to happen, as it means vmcs01 is stale relative to the software tracking, and KVM could unintentionally leave an MSR in the store list in perpetuity while running L1, e.g. if KVM addressed the first issue and updated vmcs01 on nested VM-Exit without removing TSC from the list. Furthermore, mixing KVM's desire to save an MSR with L1's desire to save an MSR result KVM clobbering/ignoring the needs of vmcs01 or vmcs02. E.g. if KVM added MSR_IA32_TSC to the store list for its own purposes, and then _removed_ MSR_IA32_TSC from the list after emulating nested VM-Enter, then KVM would remove MSR_IA32_TSC from the list even though saving TSC on VM-Exit from L2 is still desirable (to provide L1 with an accurate TSC). Similarly, removing an MSR from the list based on vmcs12's settings could drop an MSR that KVM wants to save for its own purposes. In practice, the issues are currently benign, because KVM doesn't use the store list for vmcs01. But that will change with upcoming mediated PMU support. Alternatively, a "full" solution would be to track MSR list entries for vmcs12 separately from KVM's standard lists, but MSR_IA32_TSC is likely the only MSR that KVM would ever want to save on _every_ VM-Exit purely based on vmcs12. I.e. the added complexity isn't remotely justified at this time. Opportunistically escalate from a pr_warn_ratelimited() to a full WARN as KVM reserves eight entries in each MSR list, and as above KVM uses at most one entry. Opportunistically make vmx_find_loadstore_msr_slot() local to vmx.c as using it directly from nested code is unsafe due to the potential for mixing vmcs01 and vmcs02 state (see above). Cc: Jim Mattson <jmattson@google.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-37-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: VMX: Drop intermediate "guest" field from msr_autostoreSean Christopherson
Drop the intermediate "guest" field from vcpu_vmx.msr_autostore as the value saved on VM-Exit isn't guaranteed to be the guest's value, it's purely whatever is in hardware at the time of VM-Exit. E.g. KVM's only use of the store list at the momemnt is to snapshot TSC at VM-Exit, and the value saved is always the raw TSC even if TSC-offseting and/or TSC-scaling is enabled for the guest. And unlike msr_autoload, there is no need differentiate between "on-entry" and "on-exit". No functional change intended. Cc: Jim Mattson <jmattson@google.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-36-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: x86/pmu: Elide WRMSRs when loading guest PMCs if values already matchSean Christopherson
When loading a mediated PMU state, elide the WRMSRs to load PMCs with the guest's value if the value in hardware already matches the guest's value. For the relatively common case where neither the guest nor the host is actively using the PMU, i.e. when all/many counters are '0', eliding the WRMSRs reduces the latency of handling VM-Exit by a measurable amount (WRMSR is significantly more expensive than RDPMC). As measured by KVM-Unit-Tests' CPUID VM-Exit testcase, this provides a a ~25% reduction in latency (4k => 3k cycles) on Intel Emerald Rapids, and a ~13% reduction (6.2k => 5.3k cycles) on AMD Turin. Cc: Manali Shukla <manali.shukla@amd.com> Tested-by: Xudong Hao <xudong.hao@intel.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-35-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: x86/pmu: Expose enable_mediated_pmu parameter to user spaceDapeng Mi
Expose enable_mediated_pmu parameter to user space, i.e. allow userspace to enable/disable mediated vPMU support. Document the mediated versus perf-based behavior as part of the kernel-parameters.txt entry, and opportunistically add an entry for the core enable_pmu param as well. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Tested-by: Xudong Hao <xudong.hao@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-34-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08KVM: nSVM: Disable PMU MSR interception as appropriate while running L2Sean Christopherson
Add MSRs that might be passed through to L1 when running with a mediated PMU to the nested SVM's set of to-be-merged MSR indices, i.e. disable interception of PMU MSRs when running L2 if both KVM (L0) and L1 disable interception. There is no need for KVM to interpose on such MSR accesses, e.g. if L1 exposes a mediated PMU (or equivalent) to L2. Tested-by: Xudong Hao <xudong.hao@intel.com> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/20251206001720.468579-33-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>