| Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC fixes from Arnd Bergmann:
- The ff-a firmware driver gets 11 individual bugfixes for a number of
issues with robustness to buggy firmware or client implementations.
Another firmware fix address suspend to RAM via PSCI firmware.
- The final code change is for the old Arm Integrator reference
platform that recently started exposing an old NULL pointer
dereference bug.
- The MAINTAINERS file gets two updates, notably James Tai and Yu-Chun
Lin are stepping up as co-maintainers for the Realtek platform.
- The remaining patches are all for devicetree files. Two of these are
for riscv boards, the rest are all for enesas Arm platforms,
addressing build time checking issues as well as minor configuration
problems.
* tag 'soc-fixes-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (30 commits)
firmware: psci: Set pm_set_resume/suspend_via_firmware() for SYSTEM_SUSPEND
ARM: realtek: MAINTAINERS: Include pin controller drivers
MAINTAINERS: Add maintainers for ARM/REALTEK ARCHITECTURE
ARM: integrator: Fix early initialization
firmware: arm_ffa: Fix sched-recv callback partition lookup
firmware: arm_ffa: Snapshot notifier callbacks under lock
firmware: arm_ffa: Align RxTx buffer size before mapping
firmware: arm_ffa: Validate framework notification message layout
firmware: arm_ffa: Keep framework RX release under lock
firmware: arm_ffa: Bound PARTITION_INFO_GET_REGS copies
firmware: arm_ffa: Unregister bus notifier on teardown for FF-A v1.0
firmware: arm_ffa: Fix per-vcpu self notifications handling in workqueue
firmware: arm_ffa: Avoid collapsing NPI work from different CPUs
firmware: arm_ffa: Skip free_pages on RX buffer alloc failure
firmware: arm_ffa: Check for NULL FF-A ID table while driver registration
riscv: dts: microchip: fix icicle i2c pinctrl configuration
riscv: dts: starfive: jh7110: Drop CAMSS node
arm64: dts: renesas: r9a09g056: Add #mux-state-cells to usb20phyrst
arm64: dts: renesas: r9a09g057: Add #mux-state-cells to usb2{0,1}phyrst
ARM: dts: renesas: rskrza1: Drop superfluous cells
...
|
|
This reverts commit 91f3a27ae9f66d81a5906461762c37c8a2bcab06.
Contrary to the assumption stated with the original commit description
this driver is in use and I'm going to maintain it for the foreseeable
future.
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Link: https://patch.msgid.link/alpine.DEB.2.21.2605201204260.1450@angie.orcam.me.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
A typo in the config guard in __hyp_do_panic broke the stage-2 disabling
and made backtraces for pKVM quite unreliable.
Fix that typo.
Fixes: 9019e82c7e46 ("KVM: arm64: Add PKVM_DISABLE_STAGE2_ON_PANIC")
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://patch.msgid.link/20260520220830.273289-1-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
After commit feee6b2989165631b1 ("mm/memory_hotplug: shrink zones when
offlining memory"), __remove_pages() doesn't need the "zone" parameter
so the "page" variable is also unused. Remove the unused code to avoid
such build warning:
arch/loongarch/mm/init.c: In function 'arch_remove_memory':
arch/loongarch/mm/init.c:134:22: warning: variable 'page' set but not used [-Wunused-but-set-variable=]
134 | struct page *page = pfn_to_page(start_pfn);
Cc: <stable@vger.kernel.org>
Reviewed-by: Guo Ren <guoren@kernel.org>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Validate the relocation address against the initrd region specified via
"initrd=" or "initrdmem=" on the command line. Reject relocation targets
that overlap the initrd to prevent memory corruption during early boot.
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: WANG Rui <wangrui@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
When the kernel is relocated during early boot (efistub or kexec_file),
a randomized load address may has already been selected and applied. In
this case, performing KASLR again in relocate.c is unnecessary.
Note: strictly-defined KASLR means the kernel's final runtime address
has a random offset from the kernel's load address, which is implemented
in relocate.c; broadly-defined KALSR means the kernel's final runtime
address has a random offset from the kernel's link address (a.k.a.
VMLINUX_LOAD_ADDRESS), which also include the efistlub implementation,
kexec_file implementation and QEMU direct kernel boot. kaslr_disabled()
return true only means strictly-defined KASLR is disabled.
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: WANG Rui <wangrui@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Introduce efi_get_kimg_kaslr_address() helper to compute the preferred
kernel image load address dynamically when CONFIG_RANDOMIZE_BASE is
enabled. The function derives a random offset by using the EFI-provided
randomness combined with the timer tick value, and constrains it within
CONFIG_RANDOMIZE_BASE_MAX_OFFSET.
Update EFI_KIMG_PREFERRED_ADDRESS to call this helper so that the EFI
stub can select a randomized load address when KASLR is active, while
preserving the original base address behavior when KASLR is disabled or
"nokaslr" is specified.
Note: LoongArch can't KASLR for hibernation, so set efi_nokaslr to true
if "resume=<devname>" is explicitly specified in cmdline.
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: WANG Rui <wangrui@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
On real hardware, panic and machine reboot may not flush hardware cache
to memory. This means the persistent ring buffer, which relies on a
coherent state of memory, may not have its events written to the buffer
and they may be lost. Moreover, there may be inconsistency with the
counters which are used for validation of the integrity of the
persistent ring buffer which may cause all data to be discarded.
To avoid this issue, stop recording of the ring buffer on panic and
flush the cache of the ring buffer's memory.
Fixes: e645535a954a ("tracing: Add option to use memmapped memory for trace boot instance")
Cc: stable@vger.kernel.org
Cc: Will Deacon <will@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ian Rogers <irogers@google.com>
Link: https://patch.msgid.link/177751969602.2136606.12031934362587643488.stgit@mhiramat.tok.corp.google.com
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
Booting with "nopcid" clears X86_FEATURE_PCID and keeps CR4.PCIDE from being
set to one. On AMD CPUs that support INVLPGB, broadcast TLB flushing remains
enabled.
There are two checks that decide whether the global ASID code runs,
mm_global_asid() and consider_global_asid(), that key off of the
X86_FEATURE_INVLPGB feature. Once an mm becomes active on more than three
CPUs, consider_global_asid() assigns it a global ASID, after which
flush_tlb_mm_range() takes the broadcast_tlb_flush() path using a non-zero
PCID. Issuing an INVLPGB with a non-zero PCID while CR4.PCIDE is not set
results in a #GP:
Oops: general protection fault, kernel NULL pointer dereference 0x1: 0000 [#1] SMP NOPTI
CPU: 158 UID: 0 PID: 3119 Comm: snap Not tainted 7.1.0-rc3 #1 PREEMPT(full)
Hardware name: ...
RIP: 0010:broadcast_tlb_flush
Code: ... 89 da 48 83 c8 07 <0f> 01 fe eb 08 cc cc cc ...
Call Trace:
<TASK>
flush_tlb_mm_range
ptep_clear_flush
wp_page_copy
? _raw_spin_unlock
__handle_mm_fault
handle_mm_fault
do_user_addr_fault
exc_page_fault
asm_exc_page_fault
All processors that support broadcast TLB invalidation also have PCID support,
so it is only the "nopcid" scenario that is of concern. In this situation just
disable the broadcast TLB support using the CPUID dependency support by making
X86_FEATURE_INVLPGB dependent on X86_FEATURE_PCID.
[ bp: Massage commit message. ]
Fixes: 4afeb0ed1753 ("x86/mm: Enable broadcast TLB invalidation for multi-threaded processes")
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Assisted-by: Claude:claude-opus-4.7
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: <stable@kernel.org>
Link: https://patch.msgid.link/b915acfd63e8b2a094fdeb8dc608738072518764.1779296450.git.thomas.lendacky@amd.com
|
|
Start the numbering scheme for higher-level topology structures (like
socket, book, drawer) at zero, matching the convention for other hardware
identifiers like e.g. CPU numbers.
Hardware documentation, the Hardware Management Console and other tools
like zmemtopo also use zero-based numbering for these containing entities.
Aligning the numbering in sysfs, procfs, and tools like lscpu improves
user experience by making it easier to correlate topology information
across different interfaces.
If available, Linux on s390 derives this physical topology information from
the stsi function code 15 store_topology instruction, which is defined to
start at 1 for the lowest numbered container id. Subtract one, so
drawer_id, book_id and socket_id in cpu_topology[] start with 0 for the
lowest numbered entity; and /proc/cpuinfo and tools like 'lscpu -ye'
display the expected values.
Display only, no functional change intended.
Example: In a partition with 3 cores in a system with
8 cores per socket; 2 sockets per book; 4 books per dawer; and 4 drawers:
Before this fix:
$ lscpu -ye
CPU NODE DRAWER BOOK SOCKET CORE L1d:L1i:L2 ONLINE CONFIGURED POLARIZATION ADDRESS
0 0 2 4 1 0 0:0:0 yes yes vert-high 0
1 0 2 4 1 0 1:1:1 yes yes vert-high 1
2 0 2 4 1 1 2:2:2 yes yes vert-medium 2
3 0 2 4 1 1 3:3:3 yes yes vert-medium 3
4 0 2 4 2 3 4:4:4 yes yes vert-low 4
5 0 2 4 2 3 5:5:5 yes yes vert-low 5
After this fix:
$ lscpu -ye
CPU NODE DRAWER BOOK SOCKET CORE L1d:L1i:L2 ONLINE CONFIGURED POLARIZATION ADDRESS
0 0 1 3 0 0 0:0:0 yes yes vert-high 0
1 0 1 3 0 0 1:1:1 yes yes vert-high 1
2 0 1 3 0 1 2:2:2 yes yes vert-medium 2
3 0 1 3 0 1 3:3:3 yes yes vert-medium 3
4 0 1 3 1 3 4:4:4 yes yes vert-low 4
5 0 1 3 1 3 5:5:5 yes yes vert-low 5
For KVM guests, qemu emulates the stsi FC15 store_topology instruction.
This emulation currently erroneously starts id numbering at 0. A qemu fix
is proposed that makes this emulation compliant to the stsi architecture.
In case a guest with this patch is running on a qemu without the other fix,
it can happen that ids of 255 are displayed erroneously.
z/VM currently does not provide or emulate physical topology information to
its guests. So this patch does not change anything for z/VM guests.
Fixes: 10d385895055 ("[S390] topology: expose core identifier")
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
pKVM must validate the host-provided tracing buffer descriptor.
However, if an error is found, the hypervisor would just return 0 to the
host. Fix the return value on validation failure.
While at it, rename the function to hyp_trace_desc_is_valid() and skip
validation for the nVHE mode as we trust host-provided data in that
case.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Fixes: 680a04c333fa ("KVM: arm64: Add tracing capability for the nVHE/pKVM hyp")
Link: https://lore.kernel.org/r/20260514162624.3477857-1-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Companion to commit 250f25367b58 ("KVM: arm64: Tear down vGIC on
failed vCPU creation"), which added the missing kvm_vgic_vcpu_destroy()
call to the kvm_share_hyp() failure path in kvm_arch_vcpu_create(). The
kvm_vgic_vcpu_init() failure path immediately above it has the same
shape and still needs the same cleanup.
Call kvm_vgic_vcpu_destroy() when kvm_vgic_vcpu_init() fails so private
IRQs allocated before a redistributor iodev registration failure are
released before the failed vCPU is freed.
Fixes: 03b3d00a70b5 ("KVM: arm64: vgic: Allocate private interrupts on demand")
Cc: stable@vger.kernel.org
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Yuan Yao <yaoyuan@linux.alibaba.com>
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Link: https://lore.kernel.org/r/20260519135042.2219239-1-michael.bommarito@gmail.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Userspace can restore an ITS Device Table Entry whose Size field encodes
more EventID bits than the virtual ITS supports. The live MAPD path
rejects that state, but vgic_its_restore_dte() accepts it and stores the
out-of-range value in dev->num_eventid_bits.
Reject restored DTEs with num_eventid_bits > VITS_TYPER_IDBITS before
allocating the device. This mirrors the MAPD check and prevents the
restored state from reaching vgic_its_restore_itt(), where the unchecked
value can be converted into an oversized scan_its_table() range.
Fixes: 57a9a117154c ("KVM: arm64: vgic-its: Device table save/restore")
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Link: https://lore.kernel.org/r/20260519132519.2142458-1-michael.bommarito@gmail.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
|
|
Vishal reported that KVM unit test 'x2apic' started failing after commit
0e98eb14814e ("entry: Prepare for deferred hrtimer rearming").
The reason is that KVM/VMX is injecting interrupts while it has interrupts
disabled, for a context that will enable interrupts, this means that
regs->flags.X86_EFLAGS_IF == 0 and irqentry_exit() will not do the right
thing.
Notably, irqentry_exit() must not call hrtimer_rearm_deferred() when the return
context does not have IF set, because this will cause problems vs NMIs.
Therefore, fix up the state after the injection.
Fixes: 0e98eb14814e ("entry: Prepare for deferred hrtimer rearming")
Reported-by: "Verma, Vishal L" <vishal.l.verma@intel.com>
Suggested-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: "Verma, Vishal L" <vishal.l.verma@intel.com>
Tested-by: David Woodhouse <dwmw@amazon.co.uk>
Tested-by: Zhao Liu <zhao1.liu@intel.com>
Tested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://patch.msgid.link/20260423155936.957351833@infradead.org
Closes: https://lore.kernel.org/r/70cd3e97fbb796e2eb2ff8cd4b7614ada05a5f24.camel%40intel.com
|
|
Move the VMX interrupt dispatch magic into the x86 core code. This
isolates KVM from the FRED/IDT decisions and reduces the amount of
EXPORT_SYMBOL_FOR_KVM().
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: "Verma, Vishal L" <vishal.l.verma@intel.com>
Tested-by: Zhao Liu <zhao1.liu@intel.com>
Tested-by: Zhao Liu <zhao1.liu@intel.com>
Tested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Binbin Wu <binbin.wu@linxu.intel.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20260508091829.GO3126523@noisy.programming.kicks-ass.net
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"14 hotfixes. 9 are for MM. 10 are cc:stable and the remainder are for
post-7.1 issues or aren't deemed suitable for backporting.
There's a two-patch MAINTAINERS series from Mike Rapoport which
updates us for the new KEXEC/KDUMP/crash/LUO/etc arrangements. And
another two-patch series from Muchun Song to fix a couple of
memory-hotplug issues. Otherwise singletons, please see the changelogs
for details"
* tag 'mm-hotfixes-stable-2026-05-18-21-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/memory: fix spurious warning when unmapping device-private/exclusive pages
mm: fix __vm_normal_page() to handle missing support for pmd_special()/pud_special()
drivers/base/memory: fix memory block reference leak in poison accounting
mm/memory_hotplug: fix memory block reference leak on remove
lib: kunit_iov_iter: fix test fail on powerpc
mm/page_alloc: fix initialization of tags of the huge zero folio with init_on_free
MAINTAINERS: add kexec@ list to LIVE UPDATE ENTRY
MAINTAINERS: add tree for KDUMP and KEXEC
selftests/mm: run_vmtests.sh: fix destructive tests invocation
scripts/gdb: slab: update field names of struct kmem_cache
scripts/gdb: mm: cast untyped symbols in x86_page_ops
mm/damon: fix damos_stat tracepoint format for sz_applied
mm/damon/sysfs-schemes: call missing mem_cgroup_iter_break()
mm/migrate_device: fix spinlock leak in migrate_vma_insert_huge_pmd_page
|
|
In map_vdso(), if a failure occurs during the installation of the VVAR
mappings, the error path attempts to clean up previously allocated mappings
using do_munmap(). However, the cleanup for the VVAR mapping is incorrectly
using image->size (the size of the vDSO text) instead of the actual size
allocated for the VVAR area.
Replace the incorrect do_munmap() image->size parameter with the constant
VDSO_NR_PAGES * PAGE_SIZE. Ensure the unmap size exactly matches the size
used during the vdso_install_vvar_mapping() phase to provide a symmetrical
and complete teardown of the memory region.
Fixes: e93d2521b27f ("x86/vdso: Split virtual clock pages into dedicated mapping")
Signed-off-by: Guilherme Giacomo Simoes <trintaeoitogc@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Link: https://patch.msgid.link/20260503191609.551817-1-trintaeoitogc@gmail.com
|
|
BC.cond instructions introduced by FEAT_HBC cannot be executed
out-of-line, like other branch instructions. However, they can be
simulated in the same way as B.cond instructions.
Extend the B.cond decoder mask to match BC.cond instructions as well,
and handle them using the existing B.cond simulation path.
Fixes: 7f86d128e437 ("arm64: add HWCAP for FEAT_HBC (hinted conditional branches)")
Cc: <stable@vger.kernel.org>
Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
POWER_SEQUENCING_PCIE_M2 driver handles power supply to the PCIe M.2
connectors and is required on wide variety of ARM64 platforms such as
Qcom Snapdragon X Elite laptops and Mediatek Dojo Chromebooks.
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20260514065017.11305-1-manivannan.sadhasivam@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
|
|
The kvm_riscv_vcpu_mmio_return() function handles MMIO read results
by writing the data back to the guest register. For signed load
instructions (LB, LH, LW on RV64), the value needs sign-extension
from a smaller integer to unsigned long.
The current code uses:
(ulong)data << shift >> shift
but (ulong) makes the right shift a logical shift (zero-extend)
rather than an arithmetic shift (sign-extend), causing incorrect
results when the MMIO device returns a negative value. For example,
LB reading 0x80 would return 128 instead of -128.
Fix this by casting to (long) after the left shift so that the
subsequent right shift is arithmetic and correctly propagates
the sign bit:
(long)((ulong)data << shift) >> shift
Additionally, remove the unnecessary shift assignment for LBU
(unsigned byte load) since it does not need sign extension.
This makes LBU consistent with LHU and LWU which already keep
shift = 0.
Fixes: b91f0e4cb8a3 ("RISC-V: KVM: Factor-out instruction emulation into separate sources")
Signed-off-by: Jiakai Xu <jiakaiPeanut@gmail.com>
Signed-off-by: Jiakai Xu <xujiakai2025@iscas.ac.cn>
Assisted-by: OpenClaw:DeepSeek-V3.2
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260514081752.472987-1-xujiakai2025@iscas.ac.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
|
|
The SBI v0.1 SEND_IPI handler iterates over the hart mask and calls
kvm_get_vcpu_by_id() to find the target vcpu for each set bit. When a
guest provides a hart mask containing bits for non-existent vcpu_ids,
kvm_get_vcpu_by_id() returns NULL, which is then unconditionally
dereferenced by kvm_riscv_vcpu_set_interrupt(), causing a kernel crash.
Fix this by adding a NULL check before dereferencing the return value.
If the target vcpu is not found, skip it and continue processing the
remaining valid harts.
Fixes: a046c2d8578c ("RISC-V: KVM: Reorganize SBI code by moving SBI v0.1 to its own file")
Signed-off-by: Jiakai Xu <jiakaiPeanut@gmail.com>
Signed-off-by: Jiakai Xu <xujiakai2025@iscas.ac.cn>
Assisted-by: OpenClaw:DeepSeek-V3.2
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260517124414.420919-1-xujiakai2025@iscas.ac.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
|
|
kvm_riscv_vcpu_pmu_event_info() returned -ENOMEM from the
SBI extension handler, which caused kvm_riscv_vcpu_sbi_ecall()
to abort KVM_RUN and surface the error to userspace instead of
completing the ECALL with a negative SBI error in a0.
Use SBI_ERR_FAILURE and the normal retdata path, matching other PMU
handlers and kvm_sbi_ext_pmu_handler comment.
Fixes: e309fd113b9f ("RISC-V: KVM: Implement get event info function")
Cc: stable@vger.kernel.org
Signed-off-by: Osama Abdelkader <osama.abdelkader@gmail.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260514173642.41448-2-osama.abdelkader@gmail.com
Signed-off-by: Anup Patel <anup@brainfault.org>
|
|
kvm_riscv_vcpu_pmu_snapshot_set_shmem() returned -ENOMEM from the
SBI extension handler, which caused kvm_riscv_vcpu_sbi_ecall() to
abort KVM_RUN and surface the error to userspace instead of
ompleting the ECALL with a negative SBI error in a0.
Use SBI_ERR_FAILURE and the normal retdata path, matching other PMU
handlers and kvm_sbi_ext_pmu_handler comment.
Fixes: c2f41ddbcdd7 ("RISC-V: KVM: Implement SBI PMU Snapshot feature")
Cc: stable@vger.kernel.org
Signed-off-by: Osama Abdelkader <osama.abdelkader@gmail.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260514173642.41448-1-osama.abdelkader@gmail.com
Signed-off-by: Anup Patel <anup@brainfault.org>
|
|
kvm_riscv_vcpu_record_steal_time() assumes that the steal-time shared
memory GPA (vcpu->arch.sta.shmem) is always backed by a valid guest
memory slot. However, this assumption is not guaranteed by the KVM
userspace ABI.
A malicious or buggy userspace can set the STA shared memory GPA via
KVM_SET_ONE_REG without establishing a corresponding memory region via
KVM_SET_USER_MEMORY_REGION. In such cases, the GPA cannot be translated
to a valid HVA and kvm_vcpu_gfn_to_hva() returns an error address.
The current implementation incorrectly treats this as a kernel warning
using WARN_ON(), which may escalate to a kernel panic when panic_on_warn
is enabled.
This is not a kernel bug condition but a normal invalid configuration
from userspace, and should be handled gracefully.
Fix it by removing WARN_ON() and treating invalid HVA as a normal
failure case, resetting the STA shared memory state.
Fixes: e9f12b5fff8ad0 ("RISC-V: KVM: Implement SBI STA extension")
Signed-off-by: Jiakai Xu <xujiakai2025@iscas.ac.cn>
Signed-off-by: Jiakai Xu <jiakaiPeanut@gmail.com>
Assisted-by: OpenClaw:DeepSeek-V3.2
Reviewed-by: Nutty Liu <nutty.liu@hotmail.com>
Reviewed-by: Andrew Jones <andrew.jones@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20260415075216.2757427-1-xujiakai2025@iscas.ac.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Ingo Molnar:
- Fix x86 boot crash for non-kjump kexecs (David Woodhouse)
* tag 'x86-urgent-2026-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kexec: Push kjump return address even for non-kjump kexec
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
- Fix ARM64-specific rseq regressions (Mark Rutland)
* tag 'sched-urgent-2026-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
arm64/entry: Fix arm64-specific rseq brokenness
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull MCE fix from Ingo Molnar:
- Fix an MCE polling interval adjustment regression (Borislav Petkov)
* tag 'ras-urgent-2026-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Restore MCA polling interval halving
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Paul Walmsley:
"Relatively low-impact fixes. Probably the most notable one is that we
no longer ask the monitor-mode firmware to delegate misaligned access
handling to the kernel by default, since the kernel code needs
significant improvement to match the functionality of the firmware.
This change avoids functional problems at some cost in performance,
but shouldn't affect any system with misaligned access handling in
hardware.
- Disable satp register probing when no5lvl is specified on the
kernel command line
- Fix a CFI-related issue with the misaligned access speed
measurement code
- Reduce the CFI shadow stack size limit from 4GB to 2GB (following
ARM64 GCS)
- Prevent the kernel from requesting delegation of misaligned access
faults unless a new Kconfig option, RISCV_SBI_FWFT_DELEGATE_MISALIGNED,
is enabled. This will depend on CONFIG_NONPORTABLE until the
deficiencies of the kernel misaligned access fixup code are fixed
- Fix some potential uninitialized memory accesses in error paths in
compat_riscv_gpr_set() and compat_restore_sigcontext()
- Fix a bug in the RISC-V MIPS vendor errata patching code where a
logical-and was used in place of a bitwise-and
- Drop some unnecessary code in riscv_fill_hwcap_from_isa_string()
- Use macros for isa2hwcap indices in riscv_fill_hwcap(), rather than
open-coding them
- Fix some documentation typos (one affecting 'make htmldocs')"
* tag 'riscv-for-linus-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: misaligned: Make enabling delegation depend on NONPORTABLE
riscv: Docs: fix unmatched quote warning
riscv: cfi: reduce shadow stack size limit from 4GB to 2GB
riscv: cpufeature: Use pre-defined ISA ext macros to index isa2hwcap
riscv: mm: Fixup no5lvl failure when vaddr is invalid
riscv: Fix register corruption from uninitialized cregs on error
riscv: errata: Fix bitwise vs logical AND in MIPS errata patching
Documentation: riscv: cmodx: fix typos
riscv: cpufeature: Drop this_hwcap clear in T-Head vector workaround
riscv: Define __riscv_copy_{,vec_}{words,bytes}_unaligned() using SYM_TYPED_FUNC_START
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Madhavan Srinivasan:
- Fix preempt count leak in sysfs show paths
- Fix error handling in pika_dtm_thread
- Remove pmac_low_i2c_{lock,unlock}()
- Enable all windfarms by default
- Fix dead default for GUEST_STATE_BUFFER_TEST
- Remove redundant preempt_disable|enable() calls from
arch_irq_work_raise()
Thanks to Aboorva Devarajan, Ally Heev, Amit Machhiwal, Bart Van Assche,
Christophe Leroy, Christophe Leroy (CS GROUP), Dan Carpenter, Gautam
Menghani, Harsh Prateek Bora, Julian Braha, Krzysztof Kozlowski, Linus
Walleij, Ma Ke, Ritesh Harjani (IBM), and Sayali Patil
* tag 'powerpc-7.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/time: Remove redundant preempt_disable|enable() calls from arch_irq_work_raise()
powerpc/hv-gpci: fix preempt count leak in sysfs show paths
powerpc: fix dead default for GUEST_STATE_BUFFER_TEST
powerpc/powermac: Remove pmac_low_i2c_{lock,unlock}()
powerpc/warp: Fix error handling in pika_dtm_thread
powerpc: 82xx: fix uninitialized pointers with free attribute
powerpc/g5: Enable all windfarms by default
|
|
The GMAC node incorrectly listed four clocks, including a separate tx_clk
and a TSU GCK clock sourced from ID 67. According to the SAM9X7 clocking
scheme, the GMAC uses only three clocks: HCLK, PCLK, and the TSU GCK
derived from the GMAC peripheral clock (ID 24).
Remove the unused tx_clk, update the clock-names accordingly, and correct
the assigned clock to use GCK 24 instead of GCK 67. This aligns the device
tree with the actual hardware clock topology and prevents misconfiguration
of the GMAC clock tree.
[root@SAM9X75 ~]$ cat /sys/kernel/debug/clk/clk_summary | grep gmac
gmac_gclk 1 1 1 266666666 0 0 50000 Y f802c000.ethernet tsu_clk
f802c000.ethernet tsu_clk
gmac_clk 2 2 0 266666666 0 0 50000 Y f802c000.ethernet hclk
f802c000.ethernet pclk
Fixes: 41af45af8bc3 ("ARM: dts: at91: sam9x7: add device tree for SoC")
Signed-off-by: Mihai Sain <mihai.sain@microchip.com>
Link: https://lore.kernel.org/r/20260309075329.1528-5-mihai.sain@microchip.com
[claudiu.beznea: massaged the patch description]
Signed-off-by: Claudiu Beznea <claudiu.beznea@tuxon.dev>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- one simple cleanup
- a fix for a corner case when running as Xen PV dom0
- a fix of a regression for Xen PV guests, introduced in 7.0
* tag 'for-linus-7.1b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: Tolerate nested XEN_LAZY_MMU entering/leaving
x86/xen: Fix xen_e820_swap_entry_with_ram()
xen/arm: Replace __ASSEMBLY__ with __ASSEMBLER__ in interface.h
|
|
When KASAN is enabled, such as with allmodconfig, the build fails when
building the Rust code with:
error: kernel-address sanitizer is not supported for this target
error: aborting due to 1 previous error
make[4]: *** [rust/Makefile:654: rust/core.o] Error 1
The arm-unknown-linux-gnueabi target does not support KASAN, so avoid
saying Rust is supported when it is enabled.
Cc: stable@vger.kernel.org
Fixes: ccb8ce526807 ("ARM: 9441/1: rust: Enable Rust support for ARMv7")
Link: https://github.com/Rust-for-Linux/linux/issues/1234
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Christian Schrefl <chrisi.schrefl@gmail.com>
Link: https://patch.msgid.link/20260511-arm-avoid-rust-with-kasan-v1-1-24d55f4a900b@kernel.org
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI support fixes from Rafael Wysocki:
"These fix several platform drivers that use the ACPI companion of the
given platform device without checking its presence, which may lead to
a NULL pointer dereference or other kind of malfunction if the driver
is forced to match a device without an ACPI companion via driver
override, and restore debug log level for some messages in the ACPI
CPPC library:
- Check ACPI_COMPANION() against NULL during probe in several core
ACPI device drivers (Rafael Wysocki)
- Restore log level of messages in amd_set_max_freq_ratio() (Mario
Limonciello)"
* tag 'acpi-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: PAD: xen: Check ACPI_COMPANION() against NULL
ACPI: driver: Check ACPI_COMPANION() against NULL during probe
Revert "ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn"
|
|
Merge a revert of an ACPI CPPC commit that increased the log level of
some debug messages which turned out to be a bad idea:
- Restore log level of messages in amd_set_max_freq_ratio() (Mario
Limonciello)
* acpi-cppc:
Revert "ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn"
|
|
With the support of nested lazy mmu sections it can happen that
arch_enter_lazy_mmu_mode() is being called twice without a call of
arch_leave_lazy_mmu_mode() in between, as the lazy_mmu_*() helpers
are not disabling preemption when checking for nested lazy mmu
sections.
This is a problem when running as a Xen PV guest, as
xen_enter_lazy_mmu() and xen_leave_lazy_mmu() don't tolerate this
case.
Fix that in xen_enter_lazy_mmu() and xen_leave_lazy_mmu() in order
not to hurt all other lazy mmu mode users.
Fixes: 291b3abed657 ("x86/xen: use lazy_mmu_state when context-switching")
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Message-ID: <20260508143933.493013-1-jgross@suse.com>
|
|
When swapping a not page-aligned E820 map entry with RAM, the start
address of the modified entry is calculated wrong (the offset into the
page is subtracted instead of being added to the page address).
Fixes: be35d91c8880 ("xen: tolerate ACPI NVS memory overlapping with Xen allocated memory")
Reported-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Message-ID: <20260505102417.208138-1-jgross@suse.com>
|
|
arch_irq_work_raise()
A kernel panic is observed when handling machine check exceptions from
real mode.
BUG: Unable to handle kernel data access on read at 0xc00000006be21300
Oops: Kernel access of bad area, sig: 11 [#1]
MSR: 8000000000001003 <SF,ME,RI,LE> CR: 88222248 XER: 00000005
CFAR: c00000000003ffc4 DAR: c00000006be21300 DSISR: 40000000 IRQMASK: 0
NIP [c000000000029e40] arch_irq_work_raise+0x10/0x70
LR [c00000000003ffc8] machine_check_queue_event+0xa8/0x150
Call Trace:
[c0000000179d3c70] [c00000000003ff64] machine_check_queue_event+0x44/0x150
[c0000000179d3d30] [c0000000000084e0] machine_check_early_common+0x1f0/0x2c0
The crash occurs because arch_irq_work_raise() calls preempt_disable()
from machine check exception (MCE) handlers running in real mode. In
this context, accessing the preempt_count can fault, leading to the panic.
The preempt_disable()/preempt_enable() pair in arch_irq_work_raise()
was originally added by commit 0fe1ac48bef0 ("powerpc/perf_event: Fix
oops due to perf_event_do_pending call") to avoid races while raising
irq work from exception context.
Later, commit 471ba0e686cb ("irq_work: Do not raise an IPI when
queueing work on the local CPU") added preemption protection in
irq_work_queue() path, while commit 20b876918c06 ("irq_work: Use per
cpu atomics instead of regular atomics") added equivalent
protection in irq_work_queue_on() before reaching arch_irq_work_raise():
irq_work_queue() / irq_work_queue_on()
-> preempt_disable()
-> __irq_work_queue_local()
-> irq_work_raise()
-> arch_irq_work_raise()
As a result, callers other than mce_irq_work_raise() already execute
with preemption disabled, making the additional
preempt_disable()/preempt_enable() pair in arch_irq_work_raise()
redundant.
The arch_irq_work_raise() function executes in NMI context when called
from MCE handler. Hence we will not be preempted or scheduled out since
we are in NMI context with MSR[EE]=0. Therefore, it is safe to remove
the preempt_disable()/preempt_enable() calls from here.
Remove it to avoid accessing preempt_count from real mode context.
Fixes: cc15ff327569 ("powerpc/mce: Avoid using irq_work_queue() in realmode")
Suggested-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Sayali Patil <sayalip@linux.ibm.com>
[Maddy: Fixed the commit title]
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260513081413.222490-1-sayalip@linux.ibm.com
|
|
The unaligned access emulation code in Linux has various deficiencies.
For example, it doesn't emulate vector instructions [1] [2], and doesn't
emulate KVM guest accesses. Therefore, requesting misaligned exception
delegation with SBI FWFT actually regresses vector instructions' and KVM
guests' behavior.
Until Linux can handle it properly, guard these sbi_fwft_set() calls
behind RISCV_SBI_FWFT_DELEGATE_MISALIGNED, which in turn depends on
NONPORTABLE. Those who are sure that this wouldn't be a problem can
enable this option, perhaps getting better performance.
The rest of the existing code proceeds as before, except as if
SBI_FWFT_MISALIGNED_EXC_DELEG is not available, to handle any remaining
address misaligned exceptions on a best-effort basis. The KVM SBI FWFT
implementation is also not touched, but it is disabled if the firmware
emulates unaligned accesses.
Cc: stable@vger.kernel.org
Fixes: cf5a8abc6560 ("riscv: misaligned: request misaligned exception from SBI")
Reported-by: Songsong Zhang <U2FsdGVkX1@gmail.com> # KVM
Link: https://lore.kernel.org/linux-riscv/38ce44c1-08cf-4e3f-8ade-20da224f529c@iscas.ac.cn/ [1]
Link: https://lore.kernel.org/linux-riscv/b3cfcdac-0337-4db0-a611-258f2868855f@iscas.ac.cn/ [2]
Signed-off-by: Vivian Wang <wangruikang@iscas.ac.cn>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://patch.msgid.link/20260401-riscv-misaligned-dont-delegate-v2-1-5014a288c097@iscas.ac.cn
Signed-off-by: Paul Walmsley <pjw@kernel.org>
|
|
Follow the ARM64 GCS (Guarded Control Stack) implementation approach
by reducing the shadow stack size allocation from min(RLIMIT_STACK, 4GB)
to min(RLIMIT_STACK/2, 2GB). See commit 506496bcbb42 ("arm64/gcs: Ensure
that new threads have a GCS")
Rationale:
1. Shadow stacks only store return addresses (8 bytes per entry), not
local variables, function parameters, or saved registers. A 2GB
shadow stack is far more than sufficient for any practical
application, even with extremely deep recursion. Using half the size
maintains adequate margin while being more resource-efficient.
2. On memory-constrained systems (e.g., platforms with only 4GB of
physical memory, which is a common configuration), allocating 4GB
of virtual address space for shadow stack per process/thread can
lead to virtual memory allocation failures when the overcommit mode
is set to OVERCOMMIT_GUESS or OVERCOMMIT_NEVER:
Error: "__vm_enough_memory: not enough memory for the allocation"
This reduces virtual address space consumption by 50% while maintaining
more than adequate space for return address storage.
Signed-off-by: Zong Li <zong.li@sifive.com>
Link: https://patch.msgid.link/20260428024105.645162-1-zong.li@sifive.com
[pjw@kernel.org: clean up patch description]
Signed-off-by: Paul Walmsley <pjw@kernel.org>
|
|
init_on_free
__GFP_ZEROTAGS semantics are currently a bit weird, but effectively this
flag is only ever set alongside __GFP_ZERO and __GFP_SKIP_KASAN.
If we run with init_on_free, we will zero out pages during
__free_pages_prepare(), to skip zeroing on the allocation path.
However, when allocating with __GFP_ZEROTAG set, post_alloc_hook() will
consequently not only skip clearing page content, but also skip clearing
tag memory.
Not clearing tags through __GFP_ZEROTAGS is irrelevant for most pages that
will get mapped to user space through set_pte_at() later: set_pte_at() and
friends will detect that the tags have not been initialized yet
(PG_mte_tagged not set), and initialize them.
However, for the huge zero folio, which will be mapped through a PMD
marked as special, this initialization will not be performed, ending up
exposing whatever tags were still set for the pages.
The docs (Documentation/arch/arm64/memory-tagging-extension.rst) state
that allocation tags are set to 0 when a page is first mapped to user
space. That no longer holds with the huge zero folio when init_on_free is
enabled.
Fix it by decoupling __GFP_ZEROTAGS from __GFP_ZERO, passing to
tag_clear_highpages() whether we want to also clear page content.
Invert the meaning of the tag_clear_highpages() return value to have
clearer semantics.
Reproduced with the huge zero folio by modifying the check_buffer_fill
arm64/mte selftest to use a 2 MiB area, after making sure that pages have
a non-0 tag set when freeing (note that, during boot, we will not actually
initialize tags, but only set KASAN_TAG_KERNEL in the page flags).
$ ./check_buffer_fill
1..20
...
not ok 17 Check initial tags with private mapping, sync error mode and mmap memory
not ok 18 Check initial tags with private mapping, sync error mode and mmap/mprotect memory
...
This code needs more cleanups; we'll tackle that next, like
decoupling __GFP_ZEROTAGS from __GFP_SKIP_KASAN.
[akpm@linux-foundation.org: s/__GPF_ZERO/__GFP_ZERO/, per David]
Link: https://lore.kernel.org/20260421-zerotags-v2-1-05cb1035482e@kernel.org
Fixes: adfb6609c680 ("mm/huge_memory: initialise the tags of the huge zero folio")
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Lance Yang <lance.yang@linux.dev>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Pull kvm fixes from Paolo Bonzini:
"arm64:
- Add the pKVM side of the workaround for ARM's erratum 4193714,
provided that the EL3 firmware does its part of the job. KVM will
refuse to initialise otherwise
- Correctly handle 52bit VAs for guest EL2 stage-1 translations when
running under NV with E2H==0
- Correctly deal with permission faults in guest_memfd memslots
- Fix the steal-time selftest after the infrastructure was reworked
- Make sure the host cannot pass a non-sensical clock update to the
EL2 tracing infrastructure
- Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390
ability to run arm64 guests, which will inevitably lead to arm64
code being directly used on s390
- Make sure that EL2 is configured with both exception entry and exit
being Context Synchronization Events
- Handle the current vcpu being NULL on EL2 panic
- Fix the selftest_vcpu memcache being empty at the point of donation
or sharing
- Check that the memcache has enough capacity before engaging on the
share/donate path
- Fix __deactivate_fgt() to use its parameter rather than a variable
in the macro context
s390:
- Fix array overrun with large amounts of PCI devices
x86:
- Never use L0's PAUSE loop exiting while L2 is running, since it's
unlikely that a nested guest will help solving the hypervisor's
spinlock contention
- Fix emulation of MOVNTDQA
- Fix typo in Xen hypercall tracepoint
- Add back an optimization that was left behind when recently fixing
a bug
- Add module parameter to disable CET, whose implementation seems to
have issues. For now it remains enabled by default
Generic:
- Reject offset causing an unsigned overflow in kvm_reset_dirty_gfn()
Documentation:
- Update stale links
Selftests:
- Fix guest_memfd_test with host page size > guest page size"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits)
KVM: VMX: introduce module parameter to disable CET
KVM: x86: Swap the dst and src operand for MOVNTDQA
KVM: x86: use again the flush argument of __link_shadow_page()
KVM: selftests: Ensure gmem file sizes are multiple of host page size
Documentation: kvm: update links in the references section of AMD Memory Encryption
KVM: nSVM: Never use L0's PAUSE loop exiting while L2 is running
KVM: x86: Fix Xen hypercall tracepoint argument assignment
KVM: Reject wrapped offset in kvm_reset_dirty_gfn()
KVM: arm64: Pre-check vcpu memcache for host->guest donate
KVM: arm64: Pre-check vcpu memcache for host->guest share
KVM: arm64: Seed pkvm_ownership_selftest vcpu memcache
KVM: arm64: Fix __deactivate_fgt macro parameter typo
KVM: arm64: Guard against NULL vcpu on VHE hyp panic path
KVM: arm64: Make EL2 exception entry and exit context-synchronization events
MAINTAINERS: Add Steffen as reviewer for KVM/arm64
KVM: arm64: Remove potential UB on nvhe tracing clock update
KVM: selftests: arm64: Fix steal_time test after UAPI refactoring
KVM: arm64: Handle permission faults with guest_memfd
KVM: arm64: nv: Consider the DS bit when translating TCR_EL2
KVM: arm64: Work around C1-Pro erratum 4193714 for protected guests
...
|
|
commit 446fcce2a52b ("Revert "x86: kvm: rate-limit global clock updates"")
dropped the rate limiting for KVM_REQ_GLOBAL_CLOCK_UPDATE.
As a result, kvm_arch_vcpu_load() can queue global clock update requests
every time a vCPU is scheduled when the master clock is disabled or when
the vCPU is loaded for the first time.
Restore the throttling with a per-VM ratelimit state and gate
KVM_REQ_GLOBAL_CLOCK_UPDATE through __ratelimit(), so frequent vCPU
scheduling does not generate a steady stream of redundant clock update
requests.
Fixes: 446fcce2a52b ("Revert "x86: kvm: rate-limit global clock updates"")
Signed-off-by: Lei Chen <lei.chen@smartx.com>
Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Closes: https://lore.kernel.org/all/CAK8fFZ5gY8_Mw2A=iZVFNVKQNrXQzVsn-HTd+Me9K6ZfmdgA+Q@mail.gmail.com/
Link: https://patch.msgid.link/20260409142226.2581-1-lei.chen@smartx.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
x86_virt_invoke_kvm_emergency_callback() reaches rcu_dereference()
through machine_crash_shutdown() with IRQs disabled but with RCU not
necessarily watching the crashing CPU, which triggers a suspicious
RCU usage splat on debug kernels (CONFIG_PROVE_RCU=y) during
panic/kdump:
WARNING: suspicious RCU usage
arch/x86/virt/hw.c:52 suspicious rcu_dereference_check() usage!
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by tee/11119:
#0: ffff8881fa32c440 (sb_writers#3){.+.+}-{0:0}, at: ksys_write
Call Trace:
<TASK>
dump_stack_lvl+0x84/0xd0
lockdep_rcu_suspicious.cold+0x37/0x8f
x86_virt_invoke_kvm_emergency_callback+0x5f/0x70
x86_svm_emergency_disable_virtualization_cpu+0x2a/0x30
x86_virt_emergency_disable_virtualization_cpu+0x6b/0x90
native_machine_crash_shutdown+0x72/0x170
__crash_kexec+0x137/0x280
panic+0xce/0xd0
sysrq_handle_crash+0x1f/0x20
__handle_sysrq.cold+0x192/0x335
write_sysrq_trigger+0x8c/0xc0
proc_reg_write+0x1c3/0x3c0
vfs_write+0x1d0/0xf80
ksys_write+0x116/0x250
do_syscall_64+0x11c/0x1480
entry_SYSCALL_64_after_hwframe+0x76/0x7e
</TASK>
A truly correct fix is non-trivial: the RCU usage genuinely is wrong in
panic context (RCU may ignore the crashing CPU during synchronization),
and a concurrent KVM module unload could in principle race with the
callback read; see commit 2baa33a8ddd6 ("KVM: x86: Leave user-return
notifier registered on reboot/shutdown") which notes that nothing
prevents module unload during panic/reboot.
However, the alternatives are worse:
- smp_store_release()/smp_load_acquire() handles ordering but not
liveness; the kernel still needs to keep the module text alive
while the callback is in flight.
- Taking a lock in the panic path is risky — any lock could be held
by a CPU that has already been NMI'd to a halt.
Use rcu_dereference_raw() to silence the splat and accept the
vanishingly small remaining race. Panic context inherently cannot
guarantee complete correctness; the goal here is to keep debug builds
quiet on the kdump path so the splat doesn't obscure the actual
kernel state being captured.
Reproducible on a debug kernel (CONFIG_PROVE_LOCKING=y, CONFIG_PROVE_RCU=y)
with kvm_amd or kvm_intel loaded by triggering kdump:
echo c > /proc/sysrq-trigger
Suggested-by: Sean Christopherson <seanjc@google.com>
Fixes: 428afac5a8ea ("KVM: x86: Move bulk of emergency virtualizaton logic to virt subsystem")
Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20260504235435.90957-1-mikhail.v.gavrilov@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
RongQing reported that the MCA polling interval doesn't halve when an
error gets logged. It was traced down to the commit in Fixes:, because:
mce_timer_fn()
|-> mce_poll_banks()
|-> machine_check_poll()
|-> mce_log()
which will queue the work and return.
Now, back in mce_timer_fn():
/*
* Alert userspace if needed. If we logged an MCE, reduce the polling
* interval, otherwise increase the polling interval.
*/
if (mce_notify_irq())
<--- here we haven't ran the notifier chain yet so mce_need_notify is
not set yet so this won't hit and we won't halve the interval iv.
Now the notifier chain runs. mce_early_notifier() sets the bit, does
mce_notify_irq(), that clears the bit and then the notifier chain
a little later logs the error.
So this is a silly timing issue.
But, that's all unnecessary.
All it needs to happen here is, the "should we notify of a logged MCE"
mce_notify_irq() asks, should be simply a question to the mce gen pool:
"Are you empty?"
And that then turns into a simple yes or no answer and it all
JustWorks(tm).
So do that and also distribute the functionality where it belongs:
- Print that MCE events have been logged in mce_log()
- Trigger the mcelog tool specific work in the first notifier
As a result, mce_notify_irq() can go now.
Fixes: 011d82611172 ("RAS: Add a Corrected Errors Collector")
Reported-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/r/20260112082747.2842-1-lirongqing@baidu.com
|
|
There have been reports of host hangs caused by CET virtualization.
Until these are analyzed further, introduce a module parameter that
makes it possible to easily disable it.
Link: https://lore.kernel.org/all/85548beb-1486-40f9-beb4-632c78e3360b@proxmox.com/
Cc: David Riley <d.riley@proxmox.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
KVM: s390: pci: fix array indexing
For large amounts of PCI devices its possible to overrun the arrays as
the index was miscalculated in 2 places.
|
|
Swap the MOVNTDQA operands, as MOVNTDQA does NOT in fact have "the same
characteristics as 0F E7 (MOVNTDQ)"; MOVNTDQA loads from memory and stores
to registers, while MOVNTDQ loads from registers and stores to memory.
Per the SDM:
MOVNTDQ - Move packed integer values in xmm1 to m128 using non-temporal
hint.
MOVNTDQA - Move double quadword from m128 to xmm1 using non-temporal hint
if WC memory type.
Reported-by: Josh Eads <josheads@google.com>
Fixes: c57d9bafbd0b ("KVM: x86: Add support for emulating MOVNTDQA")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260506213514.2781948-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Except in the case of parentless nested-TDP pages, mmu_page_zap_pte()
clears the SPTE but leaves the invalid_list empty. In this case, using
kvm_flush_remote_tlbs() as kvm_mmu_remote_flush_or_zap() does is overkill.
Avoid flushing the entirety of the remote TLBs unless the invalid_list
was populated: instead, use a more efficient gfn-targeting flush (if
available) and skip it altogether if the caller guarantees that a TLB
flush is not necessary.
Based-on: <20260503201029.106481-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260503210917.121840-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
its pins
i2c20 is used by the battmgr service on the ADSP to communicate with the
SBS interface of the battery. Initializing it from Linux would break the
battmgr functionality when booted in EL2. Mark those pins as reserved.
Fixes: e7733b42111c ("arm64: dts: qcom: Add support for Dell Inspiron 7441 / Latitude 7455")
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Reviewed-by: Abel Vesa <abel.vesa@oss.qualcomm.com>
Signed-off-by: Val Packett <val@packett.cool>
Link: https://lore.kernel.org/r/20260312005731.12488-2-val@packett.cool
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 7.1, take #2
- Add the pKVM side of the workaround for ARM's erratum 4193714, provided
that the EL3 firmware does its part of the job. KVM will refuse to
initialise otherwise.
- Correctly handle 52bit VAs for guest EL2 stage-1 translations when
running under NV with E2H==0.
- Correctly deal with permission faults in guest_memfd memslots.
- Fix the steal-time selftest after the infrastructure was reworked.
- Make sure the host cannot pass a non-sensical clock update to the
EL2 tracing infrastructure.
- Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390
ability to run arm64 guests, which will inevitably lead to arm64
code being directly used on s390.
- Make sure that EL2 is configured with both exception entry and exit
being Context Synchronization Events.
- Handle the current vcpu being NULL on EL2 panic.
- Fix the selftest_vcpu memcache being empty at the point of donation or
sharing.
- Check that the memcache has enough capacity before engaging on the
share/donate path.
- Fix __deactivate_fgt() to use its parameter rather than a variable
in the macro context.
|