summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2026-03-04arm64: dts: qcom: sm6125-xiaomi-ginkgo: Set memory-region for framebufferBarnabás Czémán
Use memory-region property for framebuffer instead of reg. Signed-off-by: Barnabás Czémán <barnabas.czeman@mainlining.org> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Link: https://lore.kernel.org/r/20260126-xiaomi-willow-v3-3-aad7b106c311@mainlining.org Signed-off-by: Bjorn Andersson <andersson@kernel.org>
2026-03-04arm64: dts: qcom: sm6125-xiaomi-ginkgo: Correct reserved memory rangesBarnabás Czémán
The device was crashing on high memory load because the reserved memory ranges was wrongly defined. Correct the ranges for avoid the crashes. Change the ramoops memory range to match with the values from the recovery to be able to get the results from the device. Fixes: 9b1a6c925c88 ("arm64: dts: qcom: sm6125: Initial support for xiaomi-ginkgo") Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Signed-off-by: Barnabás Czémán <barnabas.czeman@mainlining.org> Link: https://lore.kernel.org/r/20260126-xiaomi-willow-v3-2-aad7b106c311@mainlining.org Signed-off-by: Bjorn Andersson <andersson@kernel.org>
2026-03-04arm64: dts: qcom: sm6125-xiaomi-ginkgo: Remove board-idBarnabás Czémán
Remove board-id it is not necessary for the bootloader. Fixes: 9b1a6c925c88 ("arm64: dts: qcom: sm6125: Initial support for xiaomi-ginkgo") Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com> Signed-off-by: Barnabás Czémán <barnabas.czeman@mainlining.org> Link: https://lore.kernel.org/r/20260126-xiaomi-willow-v3-1-aad7b106c311@mainlining.org Signed-off-by: Bjorn Andersson <andersson@kernel.org>
2026-03-04ARM: dts: qcom: Drop unused .dtsiRob Herring (Arm)
These .dtsi files are not included anywhere in the tree and can't be tested. Signed-off-by: Rob Herring (Arm) <robh@kernel.org> Link: https://lore.kernel.org/r/20251212203226.458694-2-robh@kernel.org Signed-off-by: Bjorn Andersson <andersson@kernel.org>
2026-03-04KVM: nSVM: Always use NextRIP as vmcb02's NextRIP after first L2 VMRUNYosry Ahmed
For guests with NRIPS disabled, L1 does not provide NextRIP when running an L2 with an injected soft interrupt, instead it advances the current RIP before running it. KVM uses the current RIP as the NextRIP in vmcb02 to emulate a CPU without NRIPS. However, after L2 runs the first time, NextRIP will be updated by the CPU and/or KVM, and the current RIP is no longer the correct value to use in vmcb02. Hence, after save/restore, use the current RIP if and only if a nested run is pending, otherwise use NextRIP. Give soft_int_next_rip the same treatment, as it's the same logic, just for a narrower use case. Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE") CC: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260225005950.3739782-6-yosry@kernel.org [sean: give soft_int_next_rip the same treatment] Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04x86/mm/pat: Convert split_large_page() to use ptdescsVishal Moola (Oracle)
Use the ptdesc APIs for all page table allocation and free sites to allow their separate allocation from struct page in the future. Update split_large_page() to allocate a ptdesc instead of allocating a page for use as a page table. Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Link: https://patch.msgid.link/20260303194828.1406905-5-vishal.moola@gmail.com
2026-03-04x86/mm/pat: Convert populate_pgd() to use page table apisVishal Moola (Oracle)
Use the ptdesc APIs for all page table allocation and free sites to allow their separate allocation from struct page in the future. Convert the remaining get_zeroed_page() calls to the generic page table APIs, as they already use ptdescs. Pass through init_mm since these are kernel page tables, as both functions require it to identify kernel page tables. Because the generic implementations do not use the second argument, pass a placeholder to avoid reimplementing them or risking breakage on other architectures. It is not obvious whether these pages are freed. Regardless, convert the remaining free paths as needed, noting that the only other possible free paths have already been converted and that a frozen page table test kernel has not reported any issues. Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Link: https://patch.msgid.link/20260303194828.1406905-4-vishal.moola@gmail.com
2026-03-04x86/mm/pat: Convert pmd code to use page table apisVishal Moola (Oracle)
Use the ptdesc APIs for all page table allocation and free sites to allow their separate allocation from struct page in the future. Convert the PMD allocation and free sites to use the generic page table APIs, as they already use ptdescs. Pass through init_mm since these are kernel page tables, as pmd_alloc_one() requires it to identify kernel page tables. Because the generic implementation does not use the second argument, pass a placeholder to avoid reimplementing it or risking breakage on other architectures. Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Link: https://patch.msgid.link/20260303194828.1406905-3-vishal.moola@gmail.com
2026-03-04x86/mm/pat: Convert pte code to use page table apisVishal Moola (Oracle)
Use the ptdesc APIs for all page table allocation and free sites to allow their separate allocation from struct page in the future. Convert the PTE allocation and free sites to use the generic page table APIs, as they already use ptdescs. Pass through init_mm since these are kernel page tables; otherwise, pte_alloc_one_kernel() becomes a no-op. Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Link: https://patch.msgid.link/20260303194828.1406905-2-vishal.moola@gmail.com
2026-03-04KVM: TDX: Fold tdx_bringup() into tdx_hardware_setup()Sean Christopherson
Now that TDX doesn't need to manually enable virtualization through _KVM_ APIs during setup, fold tdx_bringup() into tdx_hardware_setup() where the code belongs, e.g. so that KVM doesn't leave the S-EPT kvm_x86_ops wired up when TDX is disabled. The weird ordering (and naming) was necessary to allow KVM TDX to use kvm_enable_virtualization(), which in turn had a hard dependency on kvm_x86_ops.enable_virtualization_cpu and thus kvm_x86_vendor_init(). Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04x86/virt/tdx: Use ida_is_empty() to detect if any TDs may be runningSean Christopherson
Drop nr_configured_hkid and instead use ida_is_empty() to detect if any HKIDs have been allocated/configured. Suggested-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04x86/virt/tdx: KVM: Consolidate TDX CPU hotplug handlingChao Gao
The core kernel registers a CPU hotplug callback to do VMX and TDX init and deinit while KVM registers a separate CPU offline callback to block offlining the last online CPU in a socket. Splitting TDX-related CPU hotplug handling across two components is odd and adds unnecessary complexity. Consolidate TDX-related CPU hotplug handling by integrating KVM's tdx_offline_cpu() to the one in the core kernel. Also move nr_configured_hkid to the core kernel because tdx_offline_cpu() references it. Since HKID allocation and free are handled in the core kernel, it's more natural to track used HKIDs there. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04x86/virt/tdx: Tag a pile of functions as __init, and globals as __ro_after_initSean Christopherson
Now that TDX-Module initialization is done during subsys init, tag all related functions as __init, and relevant data as __ro_after_init. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04KVM: x86/tdx: Do VMXON and TDX-Module initialization during subsys initSean Christopherson
Now that VMXON can be done without bouncing through KVM, do TDX-Module initialization during subsys init (specifically before module_init() so that it runs before KVM when both are built-in). Aside from the obvious benefits of separating core TDX code from KVM, this will allow tagging a pile of TDX functions and globals as being __init and __ro_after_init. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04x86/virt/tdx: Drop the outdated requirement that TDX be enabled in IRQ contextSean Christopherson
Remove TDX's outdated requirement that per-CPU enabling be done via IPI function call, which was a stale artifact leftover from early versions of the TDX enablement series. The requirement that IRQs be disabled should have been dropped as part of the revamped series that relied on a the KVM rework to enable VMX at module load. In other words, the kernel's "requirement" was never a requirement at all, but instead a reflection of how KVM enabled VMX (via IPI callback) when the TDX subsystem code was merged. Note, accessing per-CPU information is safe even without disabling IRQs, as tdx_online_cpu() is invoked via a cpuhp callback, i.e. from a per-CPU thread. Link: https://lore.kernel.org/all/ZyJOiPQnBz31qLZ7@google.com Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04x86/virt: Add refcounting of VMX/SVM usage to support multiple in-kernel usersSean Christopherson
Implement a per-CPU refcounting scheme so that "users" of hardware virtualization, e.g. KVM and the future TDX code, can co-exist without pulling the rug out from under each other. E.g. if KVM were to disable VMX on module unload or when the last KVM VM was destroyed, SEAMCALLs from the TDX subsystem would #UD and panic the kernel. Disable preemption in the get/put APIs to ensure virtualization is fully enabled/disabled before returning to the caller. E.g. if the task were preempted after a 0=>1 transition, the new task would see a 1=>2 and thus return without enabling virtualization. Explicitly disable preemption instead of requiring the caller to do so, because the need to disable preemption is an artifact of the implementation. E.g. from KVM's perspective there is no _need_ to disable preemption as KVM guarantees the pCPU on which it is running is stable (but preemption is enabled). Opportunistically abstract away SVM vs. VMX in the public APIs by using X86_FEATURE_{SVM,VMX} to communicate what technology the caller wants to enable and use. Cc: Xu Yilun <yilun.xu@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04KVM: x86: Move bulk of emergency virtualizaton logic to virt subsystemSean Christopherson
Move the majority of the code related to disabling hardware virtualization in emergency from KVM into the virt subsystem so that virt can take full ownership of the state of SVM/VMX. This will allow refcounting usage of SVM/VMX so that KVM and the TDX subsystem can enable VMX without stomping on each other. To route the emergency callback to the "right" vendor code, add to avoid mixing vendor and generic code, implement a x86_virt_ops structure to track the emergency callback, along with the SVM vs. VMX (vs. "none") feature that is active. To avoid having to choose between SVM and VMX, simply refuse to enable either if both are somehow supported. No known CPU supports both SVM and VMX, and it's comically unlikely such a CPU will ever exist. Leave KVM's clearing of loaded VMCSes and MSR_VM_HSAVE_PA in KVM, via a callback explicitly scoped to KVM. Loading VMCSes and saving/restoring host state are firmly tied to running VMs, and thus are (a) KVM's responsibility and (b) operations that are still exclusively reserved for KVM (as far as in-tree code is concerned). I.e. the contract being established is that non-KVM subsystems can utilize virtualization, but for all intents and purposes cannot act as full-blown hypervisors. Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04KVM: SVM: Move core EFER.SVME enablement to kernelSean Christopherson
Move the innermost EFER.SVME logic out of KVM and into to core x86 to land the SVM support alongside VMX support. This will allow providing a more unified API from the kernel to KVM, and will allow moving the bulk of the emergency disabling insanity out of KVM without having a weird split between kernel and KVM for SVM vs. VMX. No functional change intended. Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04KVM: VMX: Move core VMXON enablement to kernelSean Christopherson
Move the innermost VMXON+VMXOFF logic out of KVM and into to core x86 so that TDX can (eventually) force VMXON without having to rely on KVM being loaded, e.g. to do SEAMCALLs during initialization. Opportunistically update the comment regarding emergency disabling via NMI to clarify that virt_rebooting will be set by _another_ emergency callback, i.e. that virt_rebooting doesn't need to be set before VMCLEAR, only before _this_ invocation does VMXOFF. Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04x86/virt: Force-clear X86_FEATURE_VMX if configuring root VMCS failsSean Christopherson
If allocating and configuring a root VMCS fails, clear X86_FEATURE_VMX in all CPUs so that KVM doesn't need to manually check root_vmcs. As added bonuses, clearing VMX will reflect that VMX is unusable in /proc/cpuinfo, and will avoid a futile auto-probe of kvm-intel.ko. WARN if allocating a root VMCS page fails, e.g. to help users figure out why VMX is broken in the unlikely scenario something goes sideways during boot (and because the allocation should succeed unless there's a kernel bug). Tweak KVM's error message to suggest checking kernel logs if VMX is unsupported (in addition to checking BIOS). Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04KVM: VMX: Unconditionally allocate root VMCSes during boot CPU bringupSean Christopherson
Allocate the root VMCS (misleading called "vmxarea" and "kvm_area" in KVM) for each possible CPU during early boot CPU bringup, before early TDX initialization, so that TDX can eventually do VMXON on-demand (to make SEAMCALLs) without needing to load kvm-intel.ko. Allocate the pages early on, e.g. instead of trying to do so on-demand, to avoid having to juggle allocation failures at runtime. Opportunistically rename the per-CPU pointers to better reflect the role of the VMCS. Use Intel's "root VMCS" terminology, e.g. from various VMCS patents[1][2] and older SDMs, not the more opaque "VMXON region" used in recent versions of the SDM. While it's possible the VMCS passed to VMXON no longer serves as _the_ root VMCS on modern CPUs, it is still in effect a "root mode VMCS", as described in the patents. Link: https://patentimages.storage.googleapis.com/c7/e4/32/d7a7def5580667/WO2013101191A1.pdf [1] Link: https://patentimages.storage.googleapis.com/13/f6/8d/1361fab8c33373/US20080163205A1.pdf [2] Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04KVM: x86: Move "kvm_rebooting" to kernel as "virt_rebooting"Sean Christopherson
Move "kvm_rebooting" to the kernel, exported for KVM, as one of many steps towards extracting the innermost VMXON and EFER.SVME management logic out of KVM and into to core x86. For lack of a better name, call the new file "hw.c", to yield "virt hardware" when combined with its parent directory. No functional change intended. Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04KVM: VMX: Move architectural "vmcs" and "vmcs_hdr" structures to public vmx.hSean Christopherson
Move "struct vmcs" and "struct vmcs_hdr" to asm/vmx.h in anticipation of moving VMXON/VMXOFF to the core kernel (VMXON requires a "root" VMCS with the appropriate revision ID in its header). No functional change intended. Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04KVM: x86: Move kvm_rebooting to x86Sean Christopherson
Move kvm_rebooting, which is only read by x86, to KVM x86 so that it can be moved again to core x86 code. Add a "shutdown" arch hook to facilate setting the flag in KVM x86, along with a pile of comments to provide more context around what KVM x86 is doing and why. Reviewed-by: Chao Gao <chao.gao@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04arm64: make runtime const not usable by modulesJisheng Zhang
Similar as commit 284922f4c563 ("x86: uaccess: don't use runtime-const rewriting in modules") does, make arm64's runtime const not usable by modules too, to "make sure this doesn't get forgotten the next time somebody wants to do runtime constant optimizations". The reason is well explained in the above commit: "The runtime-const infrastructure was never designed to handle the modular case, because the constant fixup is only done at boot time for core kernel code." Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>
2026-03-04x86/entry/vdso32: Work around libgcc unwinder bugH. Peter Anvin
The unwinder code in libgcc has a long standing bug which causes it to fail to pick up the signal frame CFI flag. This is a generic bug across all platforms. It affects the __kernel_sigreturn and __kernel_rt_sigreturn vdso entry points on i386. The x86-64 kernel doesn't provide a sigreturn stub, and so there is no kernel-provided code that is affected on x86-64. libgcc does have a legacy fallback path which happens to work as long as the bytes immediately before each of the sigreturn functions fall outside any function. This patch adds a nop before the ALIGN to each of the sigreturn stubs to ensure that this is, indeed, the case. The rest of the patch is just a comment which documents the invariants that need to be maintained for this legacy path to work correctly. This is a manifest bug: in the current vdso, __kernel_vsyscall is a multiple of 16 bytes long and thus __kernel_sigreturn does not have any padding in front of it. Closes: https://lore.kernel.org/lkml/f3412cc3e8f66d1853cc9d572c0f2fab076872b1.camel@xry111.site Fixes: 884961618ee5 ("x86/entry/vdso32: Remove open-coded DWARF in sigreturn.S") Reported-by: Xi Ruoyao <xry111@xry111.site> Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124050 Link: https://patch.msgid.link/20260227010308.310342-1-hpa@zytor.com
2026-03-04x86/resctrl: Fix SNC detectionTony Luck
Now that the x86 topology code has a sensible nodes-per-package measure, that does not depend on the online status of CPUs, use this to divinate the SNC mode. Note that when Cluster on Die (CoD) is configured on older systems this will also show multiple NUMA nodes per package. Intel Resource Director Technology is incomaptible with CoD. Print a warning and do not use the fixup MSR_RMID_SNC_CONFIG. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Chen Yu <yu.c.chen@intel.com> Link: https://patch.msgid.link/aaCxbbgjL6OZ6VMd@agluck-desk3 Link: https://patch.msgid.link/20260303110100.367976706@infradead.org
2026-03-04x86/topo: Fix SNC topology messPeter Zijlstra
Per 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode"), the original crazy SNC-3 SLIT table was: node distances: node 0 1 2 3 4 5 0: 10 15 17 21 28 26 1: 15 10 15 23 26 23 2: 17 15 10 26 23 21 3: 21 28 26 10 15 17 4: 23 26 23 15 10 15 5: 26 23 21 17 15 10 And per: https://lore.kernel.org/lkml/20250825075642.GQ3245006@noisy.programming.kicks-ass.net/ The suggestion was to average the off-trace clusters to restore sanity. However, 4d6dd05d07d0 implements this under various assumptions: - anything GNR/CWF with numa_in_package; - there will never be more than 2 packages; - the off-trace cluster will have distance >20 And then HPE shows up with a machine that matches the Vendor-Family-Model checks but looks like this: Here's an 8 socket (2 chassis) HPE system with SNC enabled: node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0: 10 12 16 16 16 16 18 18 40 40 40 40 40 40 40 40 1: 12 10 16 16 16 16 18 18 40 40 40 40 40 40 40 40 2: 16 16 10 12 18 18 16 16 40 40 40 40 40 40 40 40 3: 16 16 12 10 18 18 16 16 40 40 40 40 40 40 40 40 4: 16 16 18 18 10 12 16 16 40 40 40 40 40 40 40 40 5: 16 16 18 18 12 10 16 16 40 40 40 40 40 40 40 40 6: 18 18 16 16 16 16 10 12 40 40 40 40 40 40 40 40 7: 18 18 16 16 16 16 12 10 40 40 40 40 40 40 40 40 8: 40 40 40 40 40 40 40 40 10 12 16 16 16 16 18 18 9: 40 40 40 40 40 40 40 40 12 10 16 16 16 16 18 18 10: 40 40 40 40 40 40 40 40 16 16 10 12 18 18 16 16 11: 40 40 40 40 40 40 40 40 16 16 12 10 18 18 16 16 12: 40 40 40 40 40 40 40 40 16 16 18 18 10 12 16 16 13: 40 40 40 40 40 40 40 40 16 16 18 18 12 10 16 16 14: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 10 12 15: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 12 10 10 = Same chassis and socket 12 = Same chassis and socket (SNC) 16 = Same chassis and adjacent socket 18 = Same chassis and non-adjacent socket 40 = Different chassis Turns out, the 'max 2 packages' thing is only relevant to the SNC-3 parts, the smaller parts do 8 sockets (like usual). The above SLIT table is sane, but violates the previous assumptions and trips a WARN. Now that the topology code has a sensible measure of nodes-per-package, we can use that to divinate the SNC mode at hand, and only fix up SNC-3 topologies. There is a 'healthy' amount of paranoia code validating the assumptions on the SLIT table, a simple pr_err(FW_BUG) print on failure and a fallback to using the regular table. Lets see how long this lasts :-) Fixes: 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode") Reported-by: Kyle Meyer <kyle.meyer@hpe.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Kyle Meyer <kyle.meyer@hpe.com> Link: https://patch.msgid.link/20260303110100.238361290@infradead.org
2026-03-04x86/topo: Replace x86_has_numa_in_packagePeter Zijlstra
.. with the brand spanking new topology_num_nodes_per_package(). Having the topology setup determine this value during MADT/SRAT parsing before SMP bringup avoids having to detect this situation when building the SMP topology masks. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Tested-by: Tony Luck <tony.luck@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Kyle Meyer <kyle.meyer@hpe.com> Link: https://patch.msgid.link/20260303110100.123701837@infradead.org
2026-03-04x86/topo: Add topology_num_nodes_per_package()Peter Zijlstra
Use the MADT and SRAT table data to compute __num_nodes_per_package. Specifically, SRAT has already been parsed in x86_numa_init(), which is called before acpi_boot_init() which parses MADT. So both are available in topology_init_possible_cpus(). This number is useful to divinate the various Intel CoD/SNC and AMD NPS modes, since the platforms are failing to provide this otherwise. Doing it this way is independent of the number of online CPUs and other such shenanigans. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Tested-by: Tony Luck <tony.luck@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Kyle Meyer <kyle.meyer@hpe.com> Link: https://patch.msgid.link/20260303110100.004091624@infradead.org
2026-03-04x86/numa: Store extra copy of numa_nodes_parsedPeter Zijlstra
The topology setup code needs to know the total number of physical nodes enumerated in SRAT; however NUMA_EMU can cause the existing numa_nodes_parsed bitmap to be fictitious. Therefore, keep a copy of the bitmap specifically to retain the physical node count. Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Kyle Meyer <kyle.meyer@hpe.com> Link: https://patch.msgid.link/20260303110059.889884023@infradead.org
2026-03-04arm64: mm: Add PTE_DIRTY back to PAGE_KERNEL* to fix kexec/hibernationCatalin Marinas
Commit 143937ca51cc ("arm64, mm: avoid always making PTE dirty in pte_mkwrite()") changed pte_mkwrite_novma() to only clear PTE_RDONLY when PTE_DIRTY is set. This was to allow writable-clean PTEs for swap pages that haven't actually been written. However, this broke kexec and hibernation for some platforms. Both go through trans_pgd_create_copy() -> _copy_pte(), which calls pte_mkwrite_novma() to make the temporary linear-map copy fully writable. With the updated pte_mkwrite_novma(), read-only kernel pages (without PTE_DIRTY) remain read-only in the temporary mapping. While such behaviour is fine for user pages where hardware DBM or trapping will make them writeable, subsequent in-kernel writes by the kexec relocation code will fault. Add PTE_DIRTY back to all _PAGE_KERNEL* protection definitions. This was the case prior to 5.4, commit aa57157be69f ("arm64: Ensure VM_WRITE|VM_SHARED ptes are clean by default"). With the kernel linear-map PTEs always having PTE_DIRTY set, pte_mkwrite_novma() correctly clears PTE_RDONLY. Fixes: 143937ca51cc ("arm64, mm: avoid always making PTE dirty in pte_mkwrite()") Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: stable@vger.kernel.org Reported-by: Jianpeng Chang <jianpeng.chang.cn@windriver.com> Link: https://lore.kernel.org/r/20251204062722.3367201-1-jianpeng.chang.cn@windriver.com Cc: Will Deacon <will@kernel.org> Cc: Huang, Ying <ying.huang@linux.alibaba.com> Cc: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com> Signed-off-by: Will Deacon <will@kernel.org>
2026-03-04arm64: Silence sparse warnings caused by the type casting in (cmp)xchgCatalin Marinas
The arm64 xchg/cmpxchg() wrappers cast the arguments to (unsigned long) prior to invoking the static inline functions implementing the operation. Some restrictive type annotations (e.g. __bitwise) lead to sparse warnings like below: sparse warnings: (new ones prefixed by >>) fs/crypto/bio.c:67:17: sparse: sparse: cast from restricted blk_status_t >> fs/crypto/bio.c:67:17: sparse: sparse: cast to restricted blk_status_t Force the casting in the arm64 xchg/cmpxchg() wrappers to silence sparse. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202602230947.uNRsPyBn-lkp@intel.com/ Link: https://lore.kernel.org/r/202602230947.uNRsPyBn-lkp@intel.com/ Cc: Will Deacon <will@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Will Deacon <will@kernel.org>
2026-03-04x86/boot: Handle relative CONFIG_EFI_SBAT_FILE file pathsJan Stancek
CONFIG_EFI_SBAT_FILE can be a relative path. When compiling using a different output directory (O=) the build currently fails because it can't find the filename set in CONFIG_EFI_SBAT_FILE: arch/x86/boot/compressed/sbat.S: Assembler messages: arch/x86/boot/compressed/sbat.S:6: Error: file not found: kernel.sbat Add $(srctree) as include dir for sbat.o. [ bp: Massage commit message. ] Fixes: 61b57d35396a ("x86/efi: Implement support for embedding SBAT data for x86") Signed-off-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: <stable@kernel.org> Link: https://patch.msgid.link/f4eda155b0cef91d4d316b4e92f5771cb0aa7187.1772047658.git.jstancek@redhat.com
2026-03-04powerpc/crash: adjust the elfcorehdr sizeSourabh Jain
With crash hotplug support enabled, additional memory is allocated to the elfcorehdr kexec segment to accommodate resources added during memory hotplug events. However, the kdump FDT is not updated with the same size, which can result in elfcorehdr corruption in the kdump kernel. Update elf_headers_sz (the kimage member representing the size of the elfcorehdr kexec segment) to reflect the total memory allocated for the elfcorehdr segment instead of the elfcorehdr buffer size at the time of kdump load. This allows of_kexec_alloc_and_setup_fdt() to reserve the full elfcorehdr memory in the kdump FDT and prevents elfcorehdr corruption. Fixes: 849599b702ef8 ("powerpc/crash: add crash memory hotplug support") Reviewed-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20260227171801.2238847-1-sourabhjain@linux.ibm.com
2026-03-04powerpc/kexec/core: use big-endian types for crash variablesSourabh Jain
Use explicit word-sized big-endian types for kexec and crash related variables. This makes the endianness unambiguous and avoids type mismatches that trigger sparse warnings. The change addresses sparse warnings like below (seen on both 32-bit and 64-bit builds): CHECK ../arch/powerpc/kexec/core.c sparse: expected unsigned int static [addressable] [toplevel] [usertype] crashk_base sparse: got restricted __be32 [usertype] sparse: warning: incorrect type in assignment (different base types) sparse: expected unsigned int static [addressable] [toplevel] [usertype] crashk_size sparse: got restricted __be32 [usertype] sparse: warning: incorrect type in assignment (different base types) sparse: expected unsigned long long static [addressable] [toplevel] mem_limit sparse: got restricted __be32 [usertype] sparse: warning: incorrect type in assignment (different base types) sparse: expected unsigned int static [addressable] [toplevel] [usertype] kernel_end sparse: got restricted __be32 [usertype] No functional change intended. Fixes: ea961a828fe7 ("powerpc: Fix endian issues in kexec and crash dump code") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202512221405.VHPKPjnp-lkp@intel.com/ Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20251224151257.28672-1-sourabhjain@linux.ibm.com
2026-03-04powerpc/prom_init: Fixup missing #size-cells on PowerMac media-bay nodesRob Herring (Arm)
Similar to other PowerMac mac-io devices, the media-bay node is missing the "#size-cells" property. Depends-on: commit 045b14ca5c36 ("of: WARN on deprecated #address-cells/#size-cells handling") Reported-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Rob Herring (Arm) <robh@kernel.org> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20251029174047.1620073-1-robh@kernel.org
2026-03-04powerpc: dts: fsl: Drop unused .dtsi filesRob Herring (Arm)
These files are not included by anything and therefore don't get built or tested. There's also no upstream driver for the interlaken-lac stuff. Signed-off-by: Rob Herring (Arm) <robh@kernel.org> Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20260128140222.1627203-1-robh@kernel.org
2026-03-04powerpc/uaccess: Fix inline assembly for clang build on PPC32Christophe Leroy (CS GROUP)
Test robot reports the following error with clang-16.0.6: In file included from kernel/rseq.c:75: include/linux/rseq_entry.h:141:3: error: invalid operand for instruction unsafe_get_user(offset, &ucs->post_commit_offset, efault); ^ include/linux/uaccess.h:608:2: note: expanded from macro 'unsafe_get_user' arch_unsafe_get_user(x, ptr, local_label); \ ^ arch/powerpc/include/asm/uaccess.h:518:2: note: expanded from macro 'arch_unsafe_get_user' __get_user_size_goto(__gu_val, __gu_addr, sizeof(*(p)), e); \ ^ arch/powerpc/include/asm/uaccess.h:284:2: note: expanded from macro '__get_user_size_goto' __get_user_size_allowed(x, ptr, size, __gus_retval); \ ^ arch/powerpc/include/asm/uaccess.h:275:10: note: expanded from macro '__get_user_size_allowed' case 8: __get_user_asm2(x, (u64 __user *)ptr, retval); break; \ ^ arch/powerpc/include/asm/uaccess.h:258:4: note: expanded from macro '__get_user_asm2' " li %1+1,0\n" \ ^ <inline asm>:7:5: note: instantiated into assembly here li 31+1,0 ^ 1 error generated. On PPC32, for 64 bits vars a pair of registers is used. Usually the lower register in the pair is the high part and the higher register is the low part. GCC uses r3/r4 ... r11/r12 ... r14/r15 ... r30/r31 In older kernel code inline assembly was using %1 and %1+1 to represent 64 bits values. However here it looks like clang uses r31 as high part, allthough r32 doesn't exist hence the error. Allthoug %1+1 should work, most places now use %L1 instead of %1+1, so let's do the same here. With that change, the build doesn't fail anymore and a disassembly shows clang uses r17/r18 and r31/r14 pair when GCC would have used r16/r17 and r30/r31: Disassembly of section .fixup: 00000000 <.fixup>: 0: 38 a0 ff f2 li r5,-14 4: 3a 20 00 00 li r17,0 8: 3a 40 00 00 li r18,0 c: 48 00 00 00 b c <.fixup+0xc> c: R_PPC_REL24 .text+0xbc 10: 38 a0 ff f2 li r5,-14 14: 3b e0 00 00 li r31,0 18: 39 c0 00 00 li r14,0 1c: 48 00 00 00 b 1c <.fixup+0x1c> 1c: R_PPC_REL24 .text+0x144 Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202602021825.otcItxGi-lkp@intel.com/ Fixes: c20beffeec3c ("powerpc/uaccess: Use flexible addressing with __put_user()/__get_user()") Signed-off-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Acked-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/8ca3a657a650e497a96bfe7acde2f637dadab344.1770103646.git.chleroy@kernel.org
2026-03-04powerpc/e500: Always use 64 bits PTEChristophe Leroy
Today there are two PTE formats for e500: - The 64 bits format, used - On 64 bits kernel - On 32 bits kernel with 64 bits physical addresses - On 32 bits kernel with support of huge pages - The 32 bits format, used in other cases Maintaining two PTE formats means unnecessary maintenance burden because every change needs to be implemented and tested for both formats. Remove the 32 bits PTE format. The memory usage increase due to larger PTEs is minimal (approx. 0,1% of memory). This also means that from now on huge pages are supported also with 32 bits physical addresses. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/04a658209ea78dcc0f3dbde6b2c29cf1939adfe9.1767721208.git.chleroy@kernel.org
2026-03-03KVM/TDX: Rename KVM_SUPPORTED_TD_ATTRS to KVM_SUPPORTED_TDX_TD_ATTRSXiaoyao Li
Rename KVM_SUPPORTED_TD_ATTRS to KVM_SUPPORTED_TDX_TD_ATTRS to include "TDX" in the name, making it clear that it pertains to TDX. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260303030335.766779-5-xiaoyao.li@intel.com
2026-03-03x86/tdx: Rename TDX_ATTR_* to TDX_TD_ATTR_*Xiaoyao Li
The macros TDX_ATTR_* and DEF_TDX_ATTR_* are related to TD attributes, which are TD-scope attributes. Naming them as TDX_ATTR_* can be somewhat confusing and might mislead people into thinking they are TDX global things. Rename TDX_ATTR_* to TDX_TD_ATTR_* to explicitly clarify they are TD-scope things. Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260303030335.766779-4-xiaoyao.li@intel.com
2026-03-03KVM/TDX: Remove redundant definitions of TDX_TD_ATTR_*Xiaoyao Li
There are definitions of TD attributes bits inside asm/shared/tdx.h as TDX_ATTR_*. Remove KVM's definitions and use the ones in asm/shared/tdx.h Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260303030335.766779-3-xiaoyao.li@intel.com
2026-03-03x86/tdx: Fix the typo in TDX_ATTR_MIGRTABLEXiaoyao Li
The TD scoped TDCS attributes are defined by bit positions. In the guest side of the TDX code, the 'tdx_attributes' string array holds pretty print names for these attributes, which are generated via macros and defines. Today these pretty print names are only used to print the attribute names to dmesg. Unfortunately there is a typo in the define for the migratable bit. Change the defines TDX_ATTR_MIGRTABLE* to TDX_ATTR_MIGRATABLE*. Update the sole user, the tdx_attributes array, to use the fixed name. Since these defines control the string printed to dmesg, the change is user visible. But the risk of breakage is almost zero since it is not exposed in any interface expected to be consumed programmatically. Fixes: 564ea84c8c14 ("x86/tdx: Dump attributes and TD_CTLS on boot") Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260303030335.766779-2-xiaoyao.li@intel.com
2026-03-03KVM: x86: Drop redundant call to kvm_deliver_exception_payload()Yosry Ahmed
In kvm_check_and_inject_events(), kvm_deliver_exception_payload() is called for pending #DB exceptions. However, shortly after, the per-vendor inject_exception callbacks are made. Both vmx_inject_exception() and svm_inject_exception() unconditionally call kvm_deliver_exception_payload(), so the call in kvm_check_and_inject_events() is redundant. Note that the extra call for pending #DB exceptions is harmless, as kvm_deliver_exception_payload() clears exception.has_payload after the first call. The call in kvm_check_and_inject_events() was added in commit f10c729ff965 ("kvm: vmx: Defer setting of DR6 until #DB delivery"). At that point, the call was likely needed because svm_queue_exception() checked whether an exception for L2 is intercepted by L1 before calling kvm_deliver_exception_payload(), as SVM did not have a check_nested_events callback. Since DR6 is updated before the #DB intercept in SVM (unlike VMX), it was necessary to deliver the DR6 payload before calling svm_queue_exception(). After that, commit 7c86663b68ba ("KVM: nSVM: inject exceptions via svm_check_nested_events") added a check_nested_events callback for SVM, which checked for L1 intercepts for L2's exceptions, and delivered the the payload appropriately before the intercept. At that point, svm_queue_exception() started calling kvm_deliver_exception_payload() unconditionally, and the call to kvm_deliver_exception_payload() from its caller became redundant. No functional change intended. Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260302154249.784529-1-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-03KVM: SVM: Skip OSVW MSR reads if current CPU doesn't support the featureSean Christopherson
Skip the OSVW RDMSRs if the current CPU doesn't enumerate support for the MSRs. In practice, checking only the boot CPU's capabilities is sufficient, as the RDMSRs should fault when unsupported, but there's no downside to being more precise, and checking only the boot CPU _looks_ wrong given the rather odd semantics of the MSRs. E.g. if a CPU doesn't support OVSW, then KVM must assume all errata are present. Link: https://patch.msgid.link/20251113231420.1695919-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-03KVM: SVM: Skip OSVW variable updates if current CPU's errata are a subsetSean Christopherson
Elide the OSVW variable updates if the current CPU's set of errata are a subset of the errata tracked in the global values, i.e. if no update is needed. There's no danger of under-reporting errata due to bailing early as KVM is purely reducing the set of "known fixed" errata. I.e. a racing update on a different CPU with _more_ errata doesn't change anything if the current CPU has the same or fewer errata relative to the status quo. If another CPU is writing osvw_len, then "len" is guaranteed to be larger than the new osvw_len and so the osvw_len update would be skipped anyways. If another CPU is setting new bits in osvw_status, then "status" is guaranteed to be a subset of the new osvw_status and the bitwise-OR would be an effective nop anyways. Link: https://patch.msgid.link/20251113231420.1695919-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-03KVM: SVM: Extract OS-visible workarounds setup to helper functionSean Christopherson
Move the initialization of the global OSVW variables to a helper function so that svm_enable_virtualization_cpu() isn't polluted with a pile of what is effectively legacy code. No functional change intended. Link: https://patch.msgid.link/20251113231420.1695919-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-03KVM: SVM: Skip OSVW MSR reads if KVM is treating all errata as presentSean Christopherson
Don't bother reading the OSVW MSRs if osvw_len is already zero, i.e. if KVM is already treating all errata as present, in which case the positive path of the if-statement is one giant nop. Opportunistically update the comment to more thoroughly explain how the MSRs work and why the code does what it does. Link: https://patch.msgid.link/20251113231420.1695919-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-03KVM: SVM: Serialize updates to global OS-Visible Workarounds variablesSean Christopherson
Guard writes to the global osvw_status and osvw_len variables with a spinlock to ensure enabling virtualization on multiple CPUs in parallel doesn't effectively drop any writes due to writing back stale data. Don't bother taking the lock when the boot CPU doesn't support the feature, as that check is constant for all CPUs, i.e. racing writes will always write the same value (zero). Note, the bug was inadvertently "fixed" by commit 9a798b1337af ("KVM: Register cpuhp and syscore callbacks when enabling hardware"), which effectively serialized calls to enable virtualization due to how the cpuhp framework "brings up" CPU. But KVM shouldn't rely on the mechanics of cphup to provide serialization. Link: https://patch.msgid.link/20251113231420.1695919-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>