linux.git/arch/x86, branch v6.11

Merge tag 'for-linus-6.11' of git://git.kernel.org/pub/scm/virt/kvm/kvm

2024-09-15T07:35:50+00:00

Pull kvm fix from Paolo Bonzini:
 "Do not always honor guest PAT on CPUs that support self-snoop.

  This triggers an issue in the bochsdrm driver, which used ioremap()
  instead of ioremap_wc() to map the video RAM.

  The revert lets video RAM use the WB memory type instead of the slower
  UC memory type"

* tag 'for-linus-6.11' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"

Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"

2024-09-15T06:49:33+00:00

This reverts commit 377b2f359d1f71c75f8cc352b5c81f2210312d83.

This caused a regression with the bochsdrm driver, which used ioremap()
instead of ioremap_wc() to map the video RAM.  After the commit, the
WB memory type is used without the IGNORE_PAT, resulting in the slower
UC memory type.  In fact, UC is slow enough to basically cause guests
to not boot... but only on new processors such as Sapphire Rapids and
Cascade Lake.  Coffee Lake for example works properly, though that might
also be an effect of being on a larger, more NUMA system.

The driver has been fixed but that does not help older guests.  Until we
figure out whether Cascade Lake and newer processors are working as
intended, revert the commit.  Long term we might add a quirk, but the
details depend on whether the processors are working as intended: for
example if they are, the quirk might reference bochs-compatible devices,
e.g. in the name and documentation, so that userspace can disable the
quirk by default and only leave it enabled if such a device is being
exposed to the guest.

If instead this is actually a bug in CLX+, then the actions we need to
take are different and depend on the actual cause of the bug.

Signed-off-by: Paolo Bonzini

Merge tag 'hyperv-fixes-signed-20240908' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

2024-09-09T16:31:55+00:00

Pull hyperv fixes from Wei Liu:

 - Add a documentation overview of Confidential Computing VM support
   (Michael Kelley)

 - Use lapic timer in a TDX VM without paravisor (Dexuan Cui)

 - Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency
   (Michael Kelley)

 - Fix a kexec crash due to VP assist page corruption (Anirudh
   Rayabharam)

 - Python3 compatibility fix for lsvmbus (Anthony Nandaa)

 - Misc fixes (Rachel Menge, Roman Kisel, zhang jiao, Hongbo Li)

* tag 'hyperv-fixes-signed-20240908' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  hv: vmbus: Constify struct kobj_type and struct attribute_group
  tools: hv: rm .*.cmd when make clean
  x86/hyperv: fix kexec crash due to VP assist page corruption
  Drivers: hv: vmbus: Fix the misplaced function description
  tools: hv: lsvmbus: change shebang to use python3
  x86/hyperv: Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency
  Documentation: hyperv: Add overview of Confidential Computing VM support
  clocksource: hyper-v: Use lapic timer in a TDX VM without paravisor
  Drivers: hv: Remove deprecated hv_fcopy declarations

KVM: x86: don't fall through case statements without annotations

2024-09-06T22:23:33+00:00

clang warns on this because it has an unannotated fall-through between
cases:

   arch/x86/kvm/x86.c:4819:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough]

and while we could annotate it as a fallthrough, the proper fix is to
just add the break for this case, instead of falling through to the
default case and the break there.

gcc also has that warning, but it looks like gcc only warns for the
cases where they fall through to "real code", rather than to just a
break.  Odd.

Fixes: d30d9ee94cc0 ("KVM: x86: Only advertise KVM_CAP_READONLY_MEM when supported by VM")
Cc: Paolo Bonzini 
Cc: Tom Dohrmann 
Signed-off-by: Linus Torvalds

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

2024-09-06T19:45:43+00:00

Pull x86 kvm fixes from Paolo Bonzini:
 "Many small fixes that accumulated while I was on vacation...

   - Fixup missed comments from the REMOVED_SPTE => FROZEN_SPTE rename

   - Ensure a root is successfully loaded when pre-faulting SPTEs

   - Grab kvm->srcu when handling KVM_SET_VCPU_EVENTS to guard against
     accessing memslots if toggling SMM happens to force a VM-Exit

   - Emulate MSR_{FS,GS}_BASE on SVM even though interception is always
     disabled, so that KVM does the right thing if KVM's emulator
     encounters {RD,WR}MSR

   - Explicitly clear BUS_LOCK_DETECT from KVM's caps on AMD, as KVM
     doesn't yet virtualize BUS_LOCK_DETECT on AMD

   - Cleanup the help message for CONFIG_KVM_AMD_SEV, and call out that
     KVM now supports SEV-SNP too

   - Specialize return value of
     KVM_CHECK_EXTENSION(KVM_CAP_READONLY_MEM), based on VM type

   - Remove unnecessary dependency on CONFIG_HIGH_RES_TIMERS

   - Note an RCU quiescent state on guest exit. This avoids a call to
     rcu_core() if there was a grace period request while guest was
     running"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: Remove HIGH_RES_TIMERS dependency
  kvm: Note an RCU quiescent state on guest exit
  KVM: x86: Only advertise KVM_CAP_READONLY_MEM when supported by VM
  KVM: SEV: Update KVM_AMD_SEV Kconfig entry and mention SEV-SNP
  KVM: SVM: Don't advertise Bus Lock Detect to guest if SVM support is missing
  KVM: SVM: fix emulation of msr reads/writes of MSR_FS_BASE and MSR_GS_BASE
  KVM: x86: Acquire kvm->srcu when handling KVM_SET_VCPU_EVENTS
  KVM: x86/mmu: Check that root is valid/loaded when pre-faulting SPTEs
  KVM: x86/mmu: Fixup comments missed by the REMOVED_SPTE=>FROZEN_SPTE rename

KVM: Remove HIGH_RES_TIMERS dependency

2024-09-05T16:04:54+00:00

Commit 92b5265d38f6a ("KVM: Depend on HIGH_RES_TIMERS") added a dependency
to high resolution timers with the comment:

    KVM lapic timer and tsc deadline timer based on hrtimer,
    setting a leftmost node to rb tree and then do hrtimer reprogram.
    If hrtimer not configured as high resolution, hrtimer_enqueue_reprogram
    do nothing and then make kvm lapic timer and tsc deadline timer fail.

That was back in 2012, where hrtimer_start_range_ns() would do the
reprogramming with hrtimer_enqueue_reprogram(). But as that was a nop with
high resolution timers disabled, this did not work. But a lot has changed
in the last 12 years.

For example, commit 49a2a07514a3a ("hrtimer: Kick lowres dynticks targets on
timer enqueue") modifies __hrtimer_start_range_ns() to work with low res
timers. There's been lots of other changes that make low res work.

ChromeOS has tested this before as well, and it hasn't seen any issues
with running KVM with high res timers disabled.  There could be problems,
especially at low HZ, for guests that do not support kvmclock and rely
on precise delivery of periodic timers to keep their clock running.
This can be the APIC timer (provided by the kernel), the RTC (provided
by userspace), or the i8254 (choice of kernel/userspace).  These guests
are few and far between these days, and in the case of the APIC timer +
Intel hosts we can use the preemption timer (which is TSC-based and has
better latency _and_ accuracy).

In KVM, only x86 is requiring CONFIG_HIGH_RES_TIMERS; perhaps a "depends
on HIGH_RES_TIMERS || EXPERT" could be added to virt/kvm, or a pr_warn
could be added to kvm_init if HIGH_RES_TIMERS are not enabled.  But in
general, it seems that there must be other code in the kernel (maybe
sound/?) that is relying on having high-enough HZ or hrtimers but that's
not documented anywhere.  Whenever you disable it you probably need to
know what you're doing and what your workload is; so the dependency is
not particularly interesting, and we can just remove it.

Signed-off-by: Steven Rostedt (Google) 
Message-ID: <20240821095127.45d17b19@gandalf.local.home>
[Added the last two paragraphs to the commit message. - Paolo]
Signed-off-by: Paolo Bonzini

x86/hyperv: fix kexec crash due to VP assist page corruption

2024-09-05T07:21:37+00:00

commit 9636be85cc5b ("x86/hyperv: Fix hyperv_pcpu_input_arg handling when
CPUs go online/offline") introduces a new cpuhp state for hyperv
initialization.

cpuhp_setup_state() returns the state number if state is
CPUHP_AP_ONLINE_DYN or CPUHP_BP_PREPARE_DYN and 0 for all other states.
For the hyperv case, since a new cpuhp state was introduced it would
return 0. However, in hv_machine_shutdown(), the cpuhp_remove_state() call
is conditioned upon "hyperv_init_cpuhp > 0". This will never be true and
so hv_cpu_die() won't be called on all CPUs. This means the VP assist page
won't be reset. When the kexec kernel tries to setup the VP assist page
again, the hypervisor corrupts the memory region of the old VP assist page
causing a panic in case the kexec kernel is using that memory elsewhere.
This was originally fixed in commit dfe94d4086e4 ("x86/hyperv: Fix kexec
panic/hang issues").

Get rid of hyperv_init_cpuhp entirely since we are no longer using a
dynamic cpuhp state and use CPUHP_AP_HYPERV_ONLINE directly with
cpuhp_remove_state().

Cc: stable@vger.kernel.org
Fixes: 9636be85cc5b ("x86/hyperv: Fix hyperv_pcpu_input_arg handling when CPUs go online/offline")
Signed-off-by: Anirudh Rayabharam (Microsoft) 
Reviewed-by: Vitaly Kuznetsov 
Reviewed-by: Michael Kelley 
Link: https://lore.kernel.org/r/20240828112158.3538342-1-anirudh@anirudhrb.com
Signed-off-by: Wei Liu 
Message-ID: <20240828112158.3538342-1-anirudh@anirudhrb.com>

KVM: x86: Only advertise KVM_CAP_READONLY_MEM when supported by VM

2024-09-02T14:56:10+00:00

Until recently, KVM_CAP_READONLY_MEM was unconditionally supported on
x86, but this is no longer the case for SEV-ES and SEV-SNP VMs.

When KVM_CHECK_EXTENSION is invoked on a VM, only advertise
KVM_CAP_READONLY_MEM when it's actually supported.

Fixes: 66155de93bcf ("KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)")
Cc: Sean Christopherson 
Cc: Paolo Bonzini 
Cc: Michael Roth 
Signed-off-by: Tom Dohrmann 
Message-ID: <20240902144219.3716974-1-erbse.13@gmx.de>
Signed-off-by: Paolo Bonzini

Merge tag 'x86-urgent-2024-09-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2024-09-01T21:43:08+00:00

Pull x86 fixes from Thomas Gleixner:

 - x2apic_disable() clears x2apic_state and x2apic_mode unconditionally,
   even when the state is X2APIC_ON_LOCKED, which prevents the kernel to
   disable it thereby creating inconsistent state.

   Reorder the logic so it actually works correctly

 - The XSTATE logic for handling LBR is incorrect as it assumes that
   XSAVES supports LBR when the CPU supports LBR. In fact both
   conditions need to be true. Otherwise the enablement of LBR in the
   IA32_XSS MSR fails and subsequently the machine crashes on the next
   XRSTORS operation because IA32_XSS is not initialized.

   Cache the XSTATE support bit during init and make the related
   functions use this cached information and the LBR CPU feature bit to
   cure this.

 - Cure a long standing bug in KASLR

   KASLR uses the full address space between PAGE_OFFSET and vaddr_end
   to randomize the starting points of the direct map, vmalloc and
   vmemmap regions. It thereby limits the size of the direct map by
   using the installed memory size plus an extra configurable margin for
   hot-plug memory. This limitation is done to gain more randomization
   space because otherwise only the holes between the direct map,
   vmalloc, vmemmap and vaddr_end would be usable for randomizing.

   The limited direct map size is not exposed to the rest of the kernel,
   so the memory hot-plug and resource management related code paths
   still operate under the assumption that the available address space
   can be determined with MAX_PHYSMEM_BITS.

   request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1
   downwards. That means the first allocation happens past the end of
   the direct map and if unlucky this address is in the vmalloc space,
   which causes high_memory to become greater than VMALLOC_START and
   consequently causes iounmap() to fail for valid ioremap addresses.

   Cure this by exposing the end of the direct map via PHYSMEM_END and
   use that for the memory hot-plug and resource management related
   places instead of relying on MAX_PHYSMEM_BITS. In the KASLR case
   PHYSMEM_END maps to a variable which is initialized by the KASLR
   initialization and otherwise it is based on MAX_PHYSMEM_BITS as
   before.

 - Prevent a data leak in mmio_read(). The TDVMCALL exposes the value of
   an initialized variabled on the stack to the VMM. The variable is
   only required as output value, so it does not have to exposed to the
   VMM in the first place.

 - Prevent an array overrun in the resource control code on systems with
   Sub-NUMA Clustering enabled because the code failed to adjust the
   index by the number of SNC nodes per L3 cache.

* tag 'x86-urgent-2024-09-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/resctrl: Fix arch_mbm_* array overrun on SNC
  x86/tdx: Fix data leak in mmio_read()
  x86/kaslr: Expose and use the end of the physical memory address space
  x86/fpu: Avoid writing LBR bit to IA32_XSS unless supported
  x86/apic: Make x2apic_disable() work correctly

KVM: SEV: Update KVM_AMD_SEV Kconfig entry and mention SEV-SNP

2024-08-28T12:46:25+00:00

SEV-SNP support is present since commit 1dfe571c12cf ("KVM: SEV: Add
initial SEV-SNP support") but Kconfig entry wasn't updated and still
mentions SEV and SEV-ES only. Add SEV-SNP there and, while on it, expand
'SEV' in the description as 'Encrypted VMs' is not what 'SEV' stands for.

No functional change.

Signed-off-by: Vitaly Kuznetsov 
Link: https://lore.kernel.org/r/20240828122111.160273-1-vkuznets@redhat.com
Signed-off-by: Sean Christopherson