| Age | Commit message (Collapse) | Author |
|
As per the GHCB spec, when using GHCB v2+ require the software scratch area
to reside in the GHCB's shared buffer. Note, things like Page State Change
(PSC) requests _rely_ on this behavior, as the guest can't provide a length
when making the request, i.e. the size of the guest payload is bounded by
the size of the shared buffer.
Failure to force usage of the GHCB, and a slew of other flaws, lets a
malicious SNP guest corrupt host kernel heap memory, and leak host heap
layout information.
setup_vmgexit_scratch() allocates a buffer via kvzalloc(exit_info_2),
where exit_info_2 is guest-controlled. With exit_info_2=24, this yields
a 24-byte allocation in kmalloc-cg-32 (32-byte slab objects). The buffer
holds an 8-byte psc_hdr followed by 8-byte psc_entry structs, so only
entries[0] and entries[1] are in-bounds.
snp_begin_psc() validates end_entry against VMGEXIT_PSC_MAX_COUNT (253)
but NOT against the actual buffer size:
idx_end = hdr->end_entry;
if (idx_end >= VMGEXIT_PSC_MAX_COUNT) { // checks 253, not buffer
snp_complete_psc(svm, ...);
return 1;
}
for (idx = idx_start; idx <= idx_end; idx++) {
entry_start = entries[idx]; // OOB when idx >= 2
The guest sets end_entry=10+, causing the host to iterate entries[2+]
which are OOB into adjacent slab objects. For each OOB entry:
- The host reads 8 bytes (OOB READ / info leak oracle)
- If the data passes PSC validation, __snp_complete_one_psc() writes
cur_page = 1 or 512 into the entry (OOB WRITE, sev.c:3806)
- If validation fails, the error response reveals whether adjacent
memory is zero vs non-zero (information disclosure to guest)
The guest controls allocation size (exit_info_2), entry range
(cur_entry/end_entry), and can fire unlimited VMGEXITs to repeatedly
hit different slab positions.
By exploiting the variety of bugs, a malicious SEV-SNP guest can:
- OOB read adjacent kmalloc-cg-32 objects (heap layout disclosure)
- OOB write cur_page bits into adjacent objects (heap corruption)
- Trigger use-after-free conditions across VMGEXITs
E.g. with KASAN enabled, a single insmod of the PoC guest module
produces 73 KASAN reports:
BUG: KASAN: slab-out-of-bounds in snp_begin_psc+0x126/0x890
Read of size 8 at addr ffff888219ffb5e0 by task qemu-system-x86/2199
BUG: KASAN: slab-out-of-bounds in snp_begin_psc+0x468/0x890
Write of size 8 at addr ffff888351566648 by task qemu-system-x86/2199
The buggy address belongs to the object at ffff888XXXXXXXXX
which belongs to the cache kmalloc-cg-32 of size 32
The buggy address is located N bytes to the right of
allocated 32-byte region [ffff888XXXXXXXXX, ffff888XXXXXXXXX)
Breakdown:
62 slab-out-of-bounds (reads + writes past allocation)
7 slab-use-after-free
4 use-after-free
All credit to Stan for the wonderful description and reproducer!
Reported-by: Stan Shaw <shawstan96@gmail.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Peter Gonda <pgonda@google.com>
Cc: Jacky Li <jackyli@google.com>
Fixes: 4af663c2f64a ("KVM: SEV: Allow per-guest configuration of GHCB protocol version")
Cc: stable@vger.kernel.org
Signed-off-by: Michael Roth <michael.roth@amd.com>
[sean: write changelog]
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 7.1, take #4
- Restore CONFIG_PKVM_DISABLE_STAGE2_ON_PANIC to its former glory by
making sure the config symbol is correctly spelled out in the code
- Don't reset the AArch32 view of the PMU counters to zero when the
guest is writing to them
- Fix an assorted collection of memory leaks in the newly added tracing
code
- Fix the capping of ZCR_EL2 which could be used in an unsanitised way
by an L2 guest
|
|
KVM x86 fixes for 7.1-rcN
- Include the kernel's linux/mman.h in KVM selftests to ensure MADV_COLLAPSE
is defined, as older libc versions may not provide it.
- Include execinfo.h if and only if KVM selftests are building against glibc,
and provide a test_dump_stack() for non-glibc builds.
- Fudge around an RCU splat in the emegerncy reboot code that is technically
a legitimate flaw, but in practice is a non-issue and fixing the flaw, e.g.
by adding locking, would incur meaningful risk, i.e. do more harm than good.
- Rate-limit global clock updates once again (but without delayed work), as
KVM was subtly relying on the old rate-limiting for NPT correction to guard
against "update storms" when running without a master clock on systems with
overcommitted CPUs.
- Fix a brown paper bag goof where KVM checked if ERAPS is "dirty" instead of
marking it dirty when emulating INVPCID.
- Flush the TLB when transitioning from xAVIC => x2AVIC to ensure the CPU TLB
doesn't contain AVIC-tagged entries for the APIC base GPA.
|
|
ZCR_EL2 can be updated by a VHE guest hypervisor either using ZCR_EL2
(which traps) or ZCR_EL1 (which does not trap). KVM handles both in
different way:
- on ZCR_EL2 trap, ZCR_EL2.LEN is immediately capped at the VM's own
VL limit. This has the potential to break existing SW that relies
on the full LEN field to be stateful.
- on ZCR_EL1 access, we do absolutely nothing.
On restoring the SVE context for an L2 guest, we directly restore the
guest hypervisor's view of ZCR_EL2 into the physical ZCR_EL2. If the
guest's view of the register was updated using the ZCR_EL2 accessor,
the value has already been sanitised (with the caveat mentioned above).
But if the guest used ZCR_EL1, the raw value is written into the HW,
and the L2 guest can now access VLs that it shouldn't.
Fix all the above by moving the VL capping to the restore points,
ensuring that:
- the HW is always programmed with a capped value, irrespective of
the accessor being used,
- the ZCR_EL2.LEN field is always completely stateful, irrespective
of the accessor being used.
Additionally, move ZCR_EL2 to be a sanitised register, ensuring that
only the LEN field is actually stateful. This requires some creative
construction of the RES0 mask, as the sysreg generation script does
not yet generate RAZ/WI fields.
Fixes: b3d29a823099 ("KVM: arm64: nv: Handle ZCR_EL2 traps")
Signed-off-by: Mark Brown <broonie@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260529-kvm-arm64-fix-zcr-len-nv-v2-1-86cad51992bd@kernel.org
[maz: rewrote commit message, tidy up access_zcr_el2()]
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The aes driver registers both skcipher and aead algorithms,
but when aead is not enabled this causes a link failure:
s390-linux-ld: arch/s390/crypto/aes_s390.o: in function `aes_s390_fini':
arch/s390/crypto/aes_s390.c:969:(.text+0x115e): undefined reference to `crypto_unregister_aead'
s390-linux-ld: arch/s390/crypto/aes_s390.o: in function `aes_s390_init':
arch/s390/crypto/aes_s390.c:1028:(.init.text+0x294): undefined reference to `crypto_register_aead'
Add the missing 'select' statement.
Fixes: bf7fa038707c ("s390/crypto: add s390 platform specific aes gcm support.")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
These FIS partition offsets were never right: the comment clearly
states the FIS index is at 0xfe0000 and 0x7f * 0x200000 is
0xfe0000.
Tested on the iTian SQ201.
Fixes: d88b11ef91b1 ("ARM: dts: Fix up SQ201 flash access")
Fixes: b5a923f8c739 ("ARM: dts: gemini: Switch to redboot partition parsing")
Signed-off-by: Linus Walleij <linusw@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into arm/fixes
Qualcomm Arm64 defconfig fixes for v7.1
A number of targets now depends on the M.2 PCIe power sequencing driver,
enable this to keep these devices functional with a defconfig build.
* tag 'qcom-arm64-defconfig-fixes-for-7.1' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux:
arm64: defconfig: Enable PCI M.2 power sequencing driver
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into arm/fixes
Qualcomm Arm64 DeviceTree fixes for v7.1
Add missing power-domain and iface clocks for the ICE node of Eliza and
Milos to avoid the validation errors that resulted from late binding
changes. Also drop the reference clock for the USB QMP PHYs, for the
same reason.
Avoid touching the 20'th I2C bus on the Hamoa-based (X Elite) Dell
laptops, as this conflicts with the battery management firmware.
* tag 'qcom-arm64-fixes-for-7.1' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux:
arm64: dts: qcom: eliza: Add power-domain and iface clk for ice node
arm64: dts: qcom: milos: Add power-domain and iface clk for ice node
arm64: dts: qcom: x1-dell-thena: remove i2c20 (battery SMBus) and reserve its pins
arm64: dts: qcom: glymur: Drop RPMh CXO clocks from QMP PHYs
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
Clang recently added support for -Wattribute-alias [1], which results in
the same warnings that necessitated commit bee20031772a ("disable
-Wattribute-alias warning for SYSCALL_DEFINEx()") for GCC.
kernel/time/itimer.c:325:1: error: alias and aliasee have different types 'long (unsigned int)' and 'long (typeof (__builtin_choose_expr((__builtin_types_compatible_p(typeof ((unsigned int)0), typeof (0LL)) || __builtin_types_compatible_p(typeof ((unsigned int)0), typeof (0ULL))), 0LL, 0L)))' (aka 'long (long)') [-Werror,-Wattribute-alias]
325 | SYSCALL_DEFINE1(alarm, unsigned int, seconds)
| ^
include/linux/syscalls.h:225:36: note: expanded from macro 'SYSCALL_DEFINE1'
225 | #define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__)
| ^
include/linux/syscalls.h:236:2: note: expanded from macro 'SYSCALL_DEFINEx'
236 | __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
| ^
include/linux/syscalls.h:251:18: note: expanded from macro '__SYSCALL_DEFINEx'
251 | __attribute__((alias(__stringify(__se_sys##name)))); \
| ^
kernel/time/itimer.c:325:1: note: aliasee is declared here
include/linux/syscalls.h:225:36: note: expanded from macro 'SYSCALL_DEFINE1'
225 | #define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__)
| ^
include/linux/syscalls.h:236:2: note: expanded from macro 'SYSCALL_DEFINEx'
236 | __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
| ^
include/linux/syscalls.h:255:18: note: expanded from macro '__SYSCALL_DEFINEx'
255 | asmlinkage long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \
| ^
<scratch space>:16:1: note: expanded from here
16 | __se_sys_alarm
| ^
Disable the warnings in the same way for clang-23 and newer. Disable the
warning about unknown warning options to avoid breaking the build for
versions of clang-23 that do not have -Wattribute-alias, such as ones
deployed by vendors like Android or CI systems or when bisecting LLVM
between llvmorg-23-init and release/23.x.
Cc: stable@vger.kernel.org
Closes: https://github.com/ClangBuiltLinux/linux/issues/2163
Link: https://github.com/llvm/llvm-project/commit/40da6920a0d71d49dfa2392b09153600b0759f5e [1]
Link: https://patch.msgid.link/20260515-syscall-disable-attribute-alias-for-clang-v1-1-9a9d95d41df6@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
|
|
When CONFIG_DEBUG_BUGVERBOSE is disabled, the s390 __BUG_ENTRY() macro
omits the format string pointer, so the generated __bug_table entry no
longer matches struct bug_entry.
With HAVE_ARCH_BUG_FORMAT enabled, the generic BUG infrastructure reads
bug_entry::format via bug_get_format(). If the format word is missing,
subsequent fields are read from the wrong offset, which may:
- Misinterpret flags (BUG vs WARN classification errors)
- Fault when dereferencing a misread format pointer
The root cause is that __BUG_ENTRY() delegates format word emission to
__BUG_ENTRY_VERBOSE(), which is conditional on CONFIG_DEBUG_BUGVERBOSE.
Fix this by moving the format field emission directly into __BUG_ENTRY()
so it is always emitted unconditionally. Remove the format parameter from
__BUG_ENTRY_VERBOSE() and keep only file/line emission conditional on
CONFIG_DEBUG_BUGVERBOSE.
Fixes: 2b71b8ab9718 ("s390/bug: Use BUG_FORMAT for DEBUG_BUGVERBOSE_DETAILED")
Signed-off-by: Jan Polensky <japo@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
It was missed that idt_do_interrupt_irqoff() gets compiled on x84_64;
this is a problem for CFI builds because it includes an unadorned
indirect call. It is however completely dead code.
Rework things to not emit this function at all.
Fixes: 0701c9e17bd9 ("x86/kvm/vmx: Move IRQ/NMI dispatch from KVM into x86 core")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Reported-by: Calvin Owens <calvin@wbinvd.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/20260526090631.GA4149641@noisy.programming.kicks-ass.net
|
|
With CONFIG_CALL_DEPTH_TRACKING enabled on an x86 retbleed-affected platform
(eg: Skylake), with retbleed=stuff, registering a dynamic ftrace trampoline
crashes on the first call into the traced function:
BUG: unable to handle page fault for address: ffff88817ae18880
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 4b53067 P4D 4b53067 PUD 0
Oops: Oops: 0002 [#1] SMP PTI
CPU: 3 UID: 0 PID: 187 Comm: usleep Not tainted 7.0.10 #243 PREEMPT(full)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-2-2 04/01/2014
Code: 24 78 00 00 00 00 48 89 ea 48 89 54 24 20 48 8b b4 24 b8 00 00 00 48 8b bc 24 b0 00 00 00 48 89 bc 24 80 00 00 00 48 83 ef 05 <65> 48 c1 3d 1f a8 b6 02 05 48 8b 15 f6 00 00 00 4c 89 3c 24 4c 89
Call Trace:
<TASK>
? find_held_lock
? exc_page_fault
? lock_release
? __x64_sys_clock_nanosleep
? lockdep_hardirqs_on_prepare
? trace_hardirqs_on
__x64_sys_clock_nanosleep
do_syscall_64
? exc_page_fault
? call_depth_return_thunk
entry_SYSCALL_64_after_hwframe
...
Kernel panic - not syncing: Fatal exception
This small reproducer allows to easily trigger the crash:
# echo 'p __x64_sys_clock_nanosleep' > /sys/kernel/tracing/kprobe_events
# echo 1 > /sys/kernel/tracing/events/kprobes/p___x64_sys_clock_nanosleep_0/enable
# usleep 1
Monitoring the crash under GDB points to the exact instruction in charge of
incrementing the call depth:
sarq $5, %gs:__x86_call_depth(%rip)
This instruction matches the one inserted by the ftrace_regs_caller from
ftrace_64.S. This emitted code was likely working fine until the introduction
of
59bec00ace28 ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()"):
it has made the call depth accounting addressing relative to $rip, instead of
being based on an absolute address.
As this code exact location depends on where the trampoline lives in memory,
the corresponding displacement needs to be adjusted at runtime to actually
correctly find the per-cpu __x86_call_depth value, otherwise the targeted
address is wrong, leading to the page fault seen above.
Fix the %rip-relative displacement of the copied CALL_DEPTH_ACCOUNT
instruction (from ftrace_regs_caller) by calling text_poke_apply_relocation(),
as it is done for example by the x86 BPF JIT compiler through
x86_call_depth_emit_accounting(). This corrects both CALL_DEPTH_ACCOUNT slots,
in ftrace_caller and ftrace_regs_caller.
[ bp: Massage. ]
Fixes: 59bec00ace28 ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()")
Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@kernel.org>
Link: https://patch.msgid.link/20260527-fix_call_depth_in_trampoline-v1-1-1c1abc8ae310@bootlin.com
|
|
During trace remote loading, hyp_trace_load() allocates the descriptor
pages but fails to store the allocated size in trace_buffer->desc_size.
As a result, when unloading the trace buffer, hyp_trace_unload() calls
free_pages_exact() with a size of 0 which fails to free the memory.
Fix this by updating the descriptor size in trace_buffer->desc_size.
Fixes: 3aed038aac8d ("KVM: arm64: Add trace remote for the nVHE/pKVM hyp")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://patch.msgid.link/20260521124613.911067-4-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
When sharing the trace buffer with the hypervisor, if sharing a page
fails, the rollback path in hyp_trace_buffer_share_hyp() misses
unsharing the metadata page (meta_va) which was successfully shared
before entering the page sharing loop.
Additionally, if a failure occurs, the cleanup calls
hyp_trace_buffer_unshare_hyp() with an incorrect CPU index. Since that
CPU's pages were already rolled back locally in the loop, this leads to
duplicate unsharing attempts.
Fix both issues affecting the rollback.
Fixes: 3aed038aac8d ("KVM: arm64: Add trace remote for the nVHE/pKVM hyp")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://patch.msgid.link/20260521124613.911067-3-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
As the hyp_trace_buffer_unshare_hyp() function name suggests we should
unshare all the previously shared pages, otherwise we leak hyp-shared
pages which won't be reusable for hyp memory.
Fix the typo by calling __unshare_page() on the meta-page, ensuring all
previously shared pages are correctly unshared.
Fixes: 3aed038aac8d ("KVM: arm64: Add trace remote for the nVHE/pKVM hyp")
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://patch.msgid.link/20260521124613.911067-2-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
AArch32 writes to PMU event counters cannot update the top 32 bits,
even when PMUv3p5 makes the counters 64-bit. KVM therefore needs to
preserve the existing high half and only update the low half written by
the guest, unless the caller explicitly forces a full reset through
PMCR.P.
The current code masks @val down to the old high half before taking
lower_32_bits(val), which means the low half is always zero. As a
result, AArch32 writes to event counters discard the guest-provided low
32 bits instead of storing them.
Build the new value from the old high 32 bits and the low 32 bits of
the value supplied by the guest.
Fixes: 26d2d0594d70 ("KVM: arm64: PMU: Do not let AArch32 change the counters' top 32 bits")
Signed-off-by: Qiang Ma <maqianga@uniontech.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://patch.msgid.link/20260526074640.791991-1-maqianga@uniontech.com
Cc: stable@vger.kernel.org
|
|
Patch in Fixes: causes the usual:
unchecked MSR access error: RDMSR from 0x17 at ... (intel_get_platform_id)
Call Trace:
early_init_intel
early_cpu_init
setup_arch
_printk
start_kernel
x86_64_start_reservations
x86_64_start_kernel
common_startup_64
because the kernel is booted in a guest.
In order to avoid it, this MSR access needs to be prevented when running
virtualized. That is usually done by checking X86_FEATURE_HYPERVISOR but
for this particular case it is too early yet.
The platform ID needs to be read as early as when microcode is loaded on
the BSP:
load_ucode_bsp ... -> get_microcode_blob ... -> intel_find_matching_signature
and by that time, CPUID leafs haven't been parsed yet.
The microcode loader already has logic to check early whether the kernel
is running virtualized so make that globally available to arch/x86/. The
query whether running virtualized is getting more and more prominent in
recent times so might as well make it an arch-global var which the rest
of the code can use.
Fixes: d8630b67ca1ed ("x86/cpu: Add platform ID to CPU info structure")
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Tested-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/all/20260430020953.1405535-1-binbin.wu@linux.intel.com
|
|
socfpga_smp_prepare_cpus() looks up the Cortex-A9 SCU node with
of_find_compatible_node(), which returns a node reference that must be
released with of_node_put().
The function maps the SCU registers and then returns without dropping
that reference, leaking the node on both the success path and the
of_iomap() failure path.
Drop the reference once the mapping attempt is complete. The returned
MMIO mapping does not depend on keeping the device node reference held.
Fixes: 122694a0c712 ("ARM: socfpga: use of_iomap to map the SCU")
Cc: stable@vger.kernel.org
Signed-off-by: Yuho Choi <dbgh9129@gmail.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
|
|
The Makefile version of rustc-option currently checks whether the option
exists for the host target instead of the target actually being compiled
for. It was done this way in commit 46e24a545cdb ("rust: kasan/kbuild:
fix missing flags on first build") to avoid a circular dependency on
target.json. However, because of this, rustc-option currently does not
function when cross-compiling from x86_64 to aarch64 if
CONFIG_SHADOW_CALL_STACK is enabled. This is because KBUILD_RUSTFLAGS
contains -Zfixed-x18 under this configuration. Since that flag does not
exist on the host target, rustc-option runs into a compilation failure
every time, leading to all flags being rejected as unsupported.
To fix this, update rustc-option to pass a --target parameter so that
the host target is not used. For targets using target.json, use a
built-in target that is as close as possible to the target created with
target.json to avoid the circular dependency on target.json.
One scenario where this causes a boot failure:
* Cross-compiled from x86_64 to aarch64.
* With CONFIG_SHADOW_CALL_STACK=y
* With CONFIG_KASAN_SW_TAGS=y
* With CONFIG_KASAN_INLINE=n
Then the resulting kernel image will fail to boot when it first calls
into Rust code with a crash along the lines of "Unable to handle kernel
paging request at virtual address 0ffffffc08541796". This is because the
call threshold is not specified, so rustc will inline kasan operations,
but the kasan shadow offset is not specified, which leads to the inlined
kasan instructions being incorrect.
Note that the -Zsanitizer=kernel-hwaddress parameter itself does not
lead to a rustc-option failure despite being aarch64-specific because
RUSTFLAGS_KASAN has not yet been added to KBUILD_RUSTFLAGS when
rustc-option is evaluated by the kasan Makefile.
Cc: stable@vger.kernel.org
Fixes: 46e24a545cdb ("rust: kasan/kbuild: fix missing flags on first build")
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Link: https://patch.msgid.link/20260507-rustc-option-cross-v2-1-2f650a49c2b5@google.com
[ Edited slightly:
- Reset variable to avoid using the environment.
- Use a simply expanded variable flavor for simplicity.
- Export variable so that behavior in sub-`make`s is consistent.
This matches other variables. - Miguel ]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
|
|
Enable IOMMUFD and VFIO cdev such that PCI pass-through to QEMU/KVM can
optionally utilize native IOMMUFD. Note that because the defconfigs do
not enable IOMMUFD_VFIO_CONTAINER the default PCI pass-through using
VFIO with the existing container interface is not affected.
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Pull kvm fixes from Paolo Bonzini:
"arm64:
- Fix ITS EventID sanitisation when restoring an interrupt
translation table.
- Fix PPI memory leak when failing to initialise a vcpu.
- Correctly return an error when the validation of a hypervisor trace
descriptor fails, and limit this validation to protected mode only.
RISC-V:
- Fix invalid HVA warning in steal-time recording
- Return SBI_ERR_FAILURE to guest upon OOM in pmu_event_info() and
pmu_snapshot_set_shmem()
- Fix NULL pointer dereference in SBI v0.1 SEND_IPI handler
- Fix sign extension of value for MMIO loads
s390:
- Fix bugs in vSIE (nested virtualization) and UCONTROL, caused by
the page table rewrite.
x86:
- Apply erratum #1235 workaround (disable AVIC IPI virtualization) on
Hygon Family 18h, just like on AMD Family 17h.
- When KVM_CAP_X86_APIC_BUS_CYCLES_NS is queried on a specific VM,
return the VM's configured APIC bus frequency instead of the
default. This is less confusing (read: not wrong) and makes it
easier to fill in CPUID information that communicates the APIC bus
frequency to the guest.
Selftests:
- Do not include glibc-internal <bits/endian.h>; it worked by chance
and broke building KVM selftests with musl"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: SVM: Disable AVIC IPI virtualization on Hygon Family 18h (erratum #1235)
KVM: selftests: Verify that KVM returns the configured APIC cycle length
KVM: x86: Return the VM's configured APIC bus frequency when queried
KVM: selftests: elf: Include <endian.h> instead of <bits/endian.h>
KVM: s390: Properly reset zero bit in PGSTE
KVM: s390: vsie: Fix redundant rmap entries
KVM: s390: vsie: Fix unshadowing logic
KVM: s390: Fix leaking kvm_s390_mmu_cache in case of errors
KVM: s390: vsie: Fix memory leak when unshadowing
KVM: arm64: Fix nVHE/pKVM hyp tracing error on invalid desc
KVM: arm64: vgic: Free private_irqs when init fails after allocation
KVM: arm64: vgic-its: Reject restored DTE with out-of-range num_eventid_bits
RISC-V: KVM: Fix sign extension for MMIO loads
RISC-V: KVM: Fix NULL pointer dereference in SBI v0.1 SEND_IPI handler
riscv: kvm: return SBI_ERR_FAILURE for pmu_event_info() when OOM
riscv: kvm: return SBI_ERR_FAILURE for pmu_snapshot_set_shmem() when OOM
RISC-V: KVM: Fix invalid HVA warning in steal-time recording
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
- On SEV guests, handle set_memory_{encrypted,decrypted}() failures
more conservatively by assuming that all affected pages are
unencrypted (Carlos López)
- Disable broadcast TLB flush when PCID is disabled (Tom Lendacky)
- Fix VMX vs. hrtimer_rearm_deferred() regression (Peter Zijlstra)
- Move IRQ/NMI dispatch code from KVM into x86 core, to prepare for a
KVM x2apic fix (Peter Zijlstra)
- Fix incorrect munmap() size on map_vdso() failure (Guilherme Giacomo
Simoes)
* tag 'x86-urgent-2026-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
virt: sev-guest: Explicitly leak pages in unknown state
x86/mm: Disable broadcast TLB flush when PCID is disabled
x86/kvm/vmx: Fix VMX vs hrtimer_rearm_deferred()
x86/kvm/vmx: Move IRQ/NMI dispatch from KVM into x86 core
x86/vdso: Fix incorrect size in munmap() on map_vdso() failure
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux
Pull nios2 fixes from Dinh Nguyen:
- Implement _THIS_IP_ for inline asm
- Add Simon Schuster as a maintainer and mark the NIOS2 as Supported
* tag 'nios2_updates_for_v7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux:
nios2: Implement _THIS_IP_ using inline asm
MAINTAINERS: arch/nios2: Add Simon Schuster as co-maintainer
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch fixes from Huacai Chen:
"Rework KASLR to avoid initrd overlap, remove some unused code to avoid
a build warning, fix some bugs in kprobes and KVM"
* tag 'loongarch-fixes-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: KVM: Move some variable declarations to paravirt.h
LoongArch: kprobes: Fix handling of fatal unrecoverable recursions
LoongArch: kprobes: Use larch_insn_text_copy() to patch instructions
LoongArch: Remove unused code to avoid build warning
LoongArch: Avoid initrd overlap during kernel relocation
LoongArch: Skip relocation-time KASLR if already applied
efi/loongarch: Randomize kernel preferred address for KASLR
|
|
Hygon Family 18h CPUs are derived from AMD Family 17h (Zen1) silicon and
share the same erratum #1235: hardware may read a stale IsRunning=1 bit
during ICR write emulation and silently fail to generate an
AVIC_IPI_FAILURE_TARGET_NOT_RUNNING VM-Exit on the sending vCPU.
The absence of the VM-Exit causes KVM to miss the required wakeup of
blocking target vCPUs, leading to hung vCPUs and unbounded delays in
guest execution.
Extend the existing AMD Family 17h erratum #1235 workaround to also cover
Hygon Family 18h. With IPI virtualization disabled, KVM never sets
IsRunning=1 in the Physical ID table, so every non-self IPI generates a
VM-Exit and is correctly emulated.
Fixes: 8de4a1c8164e ("KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235")
Cc: <stable@vger.kernel.org>
Signed-off-by: Tina Zhang <zhang_wei@open-hieco.net>
Message-ID: <20260522040014.3380201-1-zhang_wei@open-hieco.net>
|
|
When KVM_CAP_X86_APIC_BUS_CYCLES_NS is queried on a specific VM, return the
VM's configured APIC bus frequency, not KVM's default. Aside from the fact
that returning the default frequency is blatantly wrong if userspace has
changed the frequency, returning the configured frequency means userspace
can blindly trust the result, e.g. when filling PV CPUID information that
communicates the APIC bus frequency to the guest.
Fixes: 6fef518594bc ("KVM: x86: Add a capability to configure bus frequency for APIC timer")
Reported-by: David Woodhouse <dwmw2@infradead.org>
Closes: https://lore.kernel.org/all/ab84153e33fbe7c25667f595c56b310d4d5a93ef.camel@infradead.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260522173526.3539407-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
HEAD
KVM/riscv fixes for 7.1, take #1
- Fix invalid HVA warning in steal-time recording
- Return SBI_ERR_FAILURE to guest upon OOM in pmu_event_info()
and pmu_snapshot_set_shmem()
- Fix NULL pointer dereference in SBI v0.1 SEND_IPI handler
- Fix sign extension of value for MMIO loads
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
KVM: s390: some vSIE and UCONTROL fixes
Fix some memory issues and some hangs in vSIE.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 7.1, take #3
- Fix ITS EventID sanitisation when restoring an interrupt translation
table.
- Fix PPI memory leak when failing to initialise a vcpu.
- Correctly return an error when the validation of a hypervisor trace
descriptor fails, and limit this validation to protected mode only.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Handle probe on hinted conditional branch instructions.
BC.cond instructions can be simulated in the same way as B.cond
instructions, so extend the decode mask for B.cond to cover BC.cond
- Flush the walk cache when unsharing PMD tables. Recent changes to
huge_pmd_unshare() introduced mmu_gather::unshared_tables but the
arm64 code was still treating the TLB flushing as only targeting leaf
entries (TLBI VALE1IS).
Fix it by using non-leaf-only instructions (TLBI VAE1IS) when
tlb->unshared_tables is set
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: tlb: Flush walk cache when unsharing PMD tables
arm64: probes: Handle probes on hinted conditional branch instructions
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Alexander Gordeev:
- Fix PAI NNPA mismatch between counting and recording, where sampling
reports twice the value
- Fix loss of PAI counter increments during recording on systems with
many CPUs under heavy load, while counting is not affected
- On some supported machines, CHSC cannot access memory outside the DMA
zone, causing CHSC command failures. Restore GFP_DMA flag when
allocating memory for CHSC control blocks
- Align the numbering scheme for higher-level topology structures like
socket, book, drawer with other hardware identifiers e.g. in sysfs,
procfs and tools like lscpu
* tag 's390-7.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/topology: Use zero-based numbering for containing entities
s390/cio: Restore GFP_DMA for CHSC allocation
s390/pai: Fix missing PAI counter increments under heavy load
s390/pai: Disable duplicate read of kernel PAI counter value
|
|
When huge_pmd_unshare() is called to unshare a PMD table, the
tlb_unshare_pmd_ptdesc() function sets tlb->unshared_tables=true
but the aarch64 tlb_flush() only checked tlb->freed_tables to
determine whether to use TLBF_NONE (vae1is, invalidates walk
cache) or TLBF_NOWALKCACHE (vale1is, leaf-only).
This caused the stale PMD page table entry to remain in the walk cache
after unshare, potentially leading to incorrect page table walks.
Fix by including unshared_tables in the check, so that when
unsharing tables, TLBF_NONE is used and the walk cache is properly
invalidated.
Here is the detailed distinction between vae1is and vale1is:
| Instruction Combination | Actual Invalidation Scope |
| ------------------------ | --------------------------------------------------|
| `VAE1IS` + TTL=`0` | All entries at all levels (full invalidation) |
| `VAE1IS` + TTL=`2` (L2) | Non-leaf at Level 0/1 + leaf at Level 2 |
| `VALE1IS` + TTL=`0` | Leaf entries at all levels (non-leaf not cleared) |
| `VALE1IS` + TTL=`2` (L2) | Leaf entry at Level 2 only |
Signed-off-by: Zeng Heng <zengheng4@huawei.com>
Fixes: 8ce720d5bd91 ("mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather")
Cc: <stable@vger.kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Prevent a crash from happening as the first serial port is initialised:
Console: switching to mono frame buffer device 160x64
fb0: PMAG-AA frame buffer device at tc0
DECstation Z85C30 serial driver version 0.10
CPU 0 Unable to handle kernel paging request at virtual address 0000002c, epc == 803ab00c, ra == 803aafe0
Oops[#1]:
CPU: 0 PID: 1 Comm: swapper Not tainted 6.4.0-rc3-00031-g84a9582fd203-dirty #57
$ 0 : 00000000 10012c00 803aaeb0 00000000
$ 4 : 80e12f60 80e12f50 80e12f58 81000030
$ 8 : 00000000 805ff37c 00000000 33433538
$12 : 65732030 00000006 80c2915d 6c616972
$16 : 80e12f00 807b7630 00000000 00000000
$20 : 00000004 00000348 000001a0 807623b8
$24 : 00000018 00000000
$28 : 80c24000 80c25d60 8078b148 803aafe0
Hi : 00000000
Lo : 00000000
epc : 803ab00c serial_base_ctrl_add+0x78/0xf4
ra : 803aafe0 serial_base_ctrl_add+0x4c/0xf4
Status: 10012c03 KERNEL EXL IE
Cause : 00000008 (ExcCode 02)
BadVA : 0000002c
PrId : 00000440 (R4400SC)
Modules linked in:
Process swapper (pid: 1, threadinfo=(ptrval), task=(ptrval), tls=00000000)
Stack : 80760000 00000cc0 00400044 00400040 803aa02c 80d61ab8 00000000 807b7630
80760000 807623b8 807b7628 803aa644 80386998 00000000 80e17780 80220f68
80e17780 80d61ab8 80c17d80 80e17780 80e17780 8063c798 80e17780 80383fa0
00000010 80e17780 00000000 80386998 807a0000 00000000 00400040 8038f848
807623b8 80d61ab8 00000004 80e17780 00000000 803a68e4 80c25e2c 803bb884
...
Call Trace:
[<803ab00c>] serial_base_ctrl_add+0x78/0xf4
[<803aa644>] serial_core_register_port+0x174/0x69c
[<8077e9ac>] zs_init+0xc8/0xfc
[<800404d4>] do_one_initcall+0x40/0x2ac
[<8076cecc>] kernel_init_freeable+0x1e4/0x270
[<80605bec>] kernel_init+0x20/0x108
[<800431e8>] ret_from_kernel_thread+0x14/0x1c
Code: 2442aeb0 ae120024 ae0200d0 <8c67002c> 50e00001 8c670000 3c06806e 3c05806e afb30010
---[ end trace 0000000000000000 ]---
(report at the offending commit) -- where a pointer is dereferenced that
has been derived from a null pointer to the port's parent device.
Since no device is available with legacy probing and it's not anymore a
preferable way to discover devices anyway, switch the driver to using a
platform device and use it as the port's parent device. Update resource
handling accordingly and only request the actual span of addresses used
within the slot, which will have had its resource already requested by
generic platform device code.
Use platform_driver_probe() not just because SCC devices are fixed with
solder on board and not straightforward to remove, but foremost because
the associated TTY's major device number is the same as used by the dz
driver and the first driver to claim it will prevent the other one from
using it. Either one DZ device or some SCC devices will be present in a
given system but never both at a time, and therefore we want the major
device number to be claimed by the first driver to actually successfully
bind to its device and platform_driver_probe() is a way to fulfil that.
An unfortunate consequence of the switch to a platform device is we now
hand the console over from the bootconsole much later in the bootstrap.
The firmware console handler appears good enough though to work so late
and in particular with interrupts enabled.
Since there is one way only remaining to reach zs_reset() now, remove
the port initialisation marker as no longer needed and go through the
channel reset unconditionally.
Fixes: 84a9582fd203 ("serial: core: Start managing serial controllers to enable runtime PM")
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Cc: stable@vger.kernel.org # needs to use .remove_new for <= 6.10
Link: https://patch.msgid.link/alpine.DEB.2.21.2605062328480.46195@angie.orcam.me.uk
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Prevent a crash from happening as the first serial port is initialised:
Console: switching to colour frame buffer device 160x64
tgafb: SFB+ detected, rev=0x02
fb0: Digital ZLX-E1 frame buffer device at 0x1e000000
DECstation DZ serial driver version 1.04
CPU 0 Unable to handle kernel paging request at virtual address 000000bc, epc == 8048b3a4, ra == 80470a78
Oops[#1]:
CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.19.0-dirty #35 NONE
$ 0 : 00000000 1000ac00 00000004 804707ac
$ 4 : 00000000 80e20850 80e20858 81000030
$ 8 : 00000000 8072c81c 00000008 fefefeff
$12 : 6c616972 00000006 80c5917f 69726420
$16 : 80e20800 00000000 808f8968 80e20800
$20 : 00000000 807f5a90 808b0094 808d3bc8
$24 : 00000018 80479030
$28 : 80c2e000 80c2fd70 00000069 80470a78
Hi : 00000004
Lo : 00000000
epc : 8048b3a4 __dev_fwnode+0x0/0xc
ra : 80470a78 serial_base_ctrl_add+0xa0/0x168
Status: 1000ac04 IEp
Cause : 30000008 (ExcCode 02)
BadVA : 000000bc
PrId : 00000220 (R3000)
Modules linked in:
Process swapper/0 (pid: 1, threadinfo=(ptrval), task=(ptrval), tls=00000000)
Stack : 00400044 00400040 8046f4cc 00000000 808a6148 808a0000 808f8968 8086983c
808e0000 8046fc84 1000ac01 00000028 80e20700 802ba3f8 80e20700 80d34a94
80c1b900 80e20700 80e20700 80e20700 80e20700 80444650 00000000 00000000
00000000 807f5a90 808b0094 80447080 00400040 808e0000 80d34a94 808a6148
80d34a94 00000004 80e20700 00000000 8076974c 80469810 80c2fe3c 1000ac01
...
Call Trace:
[<8048b3a4>] __dev_fwnode+0x0/0xc
[<80470a78>] serial_base_ctrl_add+0xa0/0x168
[<8046fc84>] serial_core_register_port+0x1c8/0x974
[<808c6af0>] dz_init+0x74/0xc8
[<800470e0>] do_one_initcall+0x44/0x2d4
[<808b111c>] kernel_init_freeable+0x258/0x308
[<8072e434>] kernel_init+0x20/0x114
[<80049cd0>] ret_from_kernel_thread+0x14/0x1c
Code: 27bd0018 03e00008 2402ffea <8c8200bc> 03e00008 00000000 27bdffc0 afbe0038 afb30024
---[ end trace 0000000000000000 ]---
-- where a pointer is dereferenced that has been derived from a null
pointer to the port's parent device.
Since no device is available with legacy probing and it's not anymore a
preferable way to discover devices anyway, switch the driver to using a
platform device and use it as the port's parent device. Update resource
handling accordingly and only request the actual span of addresses used
within the slot, which will have had its resource already requested by
generic platform device code.
Use platform_driver_probe() not just because the DZ device is fixed with
solder on board and not straightforward to remove, but foremost because
the associated TTY's major device number is the same as used by the zs
driver and the first driver to claim it will prevent the other one from
using it. Either one DZ device or some SCC devices will be present in a
given system but never both at a time, and therefore we want the major
device number to be claimed by the first driver to actually successfully
bind to its device and platform_driver_probe() is a way to fulfil that.
An unfortunate consequence of the switch to a platform device is we now
hand the console over from the bootconsole much later in the bootstrap.
The firmware console handler appears good enough though to work so late
and in particular with interrupts enabled.
Conversely only starting the console port so late lets the reset code
fully utilise our delay handlers, so switch from udelay() to fsleep()
for transmitter draining so as to avoid busy-waiting for an excessive
amount of time.
Fixes: 84a9582fd203 ("serial: core: Start managing serial controllers to enable runtime PM")
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Cc: stable@vger.kernel.org # needs to use .remove_new for <= 6.10
Link: https://patch.msgid.link/alpine.DEB.2.21.2605062326540.46195@angie.orcam.me.uk
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
In case of memory pressure, it's possible that a guest page gets freed
and then almost immediately reused by the guest. If CMMA is enabled,
_essa_clear_cbrl() will discard all pages that are either unused or
zero. If a discarded page is reused before _essa_clear_cbrl() is called,
and the pgste.zero bit is not cleared, the page will be discarded
despite not being unused.
When calling _gmap_ptep_xchg(), always clear the pgste.zero bit. This
prevents the page from being accidentally discarded when not unused.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Fixes: a2c17f9270cc ("KVM: s390: New gmap code")
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
|
|
The address passed to the gmap rmap was not being masked. As a
consequence several different (but functionally equivalent) rmap
entries were being created for each shadowed table.
Fix this by properly masking the address depending on the table level.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Fixes: a2c17f9270cc ("KVM: s390: New gmap code")
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
|
|
In some cases (i.e. under extreme memory pressure on the host),
attempting to shadow memory will result in the same memory being
unshadowed, causing a loop.
Add a PGSTE bit to distinguish between shadowed memory and shadowed DAT
tables, fix the unshadowing logic in _gmap_ptep_xchg() to prevent
unnecessary unshadowing and perform better checks.
Also fix the unshadowing logic in _gmap_crstep_xchg_atomic() which did
not unshadow properly when the large page would become unprotected.
Opportunistically add a check in gmap_protect_rmap() to make sure it
won't be called with level == TABLE_TYPE_PAGE_TABLE.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Fixes: a2c17f9270cc ("KVM: s390: New gmap code")
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
|
|
Fix a memory leak that can happen if gmap_ucas_map_one() or
kvm_s390_mmu_cache_topup() return error values.
Also fix a similar issue in gmap_set_limit().
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Fixes: a2c17f9270cc ("KVM: s390: New gmap code")
Reported-by: Jiaxin Fan <jiaxin.fan@ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
|
|
When performing a partial unshadowing, the rmap was being leaked.
Add the missing kfree().
Fixes: a2c17f9270cc ("KVM: s390: New gmap code")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
|
|
Some variables relative with paravirt feature are declared in the header
file asm/qspinlock.h, however this file can be included only when option
CONFIG_SMP is on. There is compiling warnings if CONFIG_SMP is off since
variables are not declared.
Move these variable declarations to header file asm/paravirt.h to avoid
compiling warnings.
Fixes: c43dce6f13fb ("LoongArch: KVM: Make vcpu_is_preempted() as a macro rather than function")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202605061313.O8Hswm2b-lkp@intel.com/
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
KPROBE_HIT_SS and KPROBE_REENTER are two types of fatal recursions that
can not be safely recovered in kprobes.
KPROBE_HIT_SS means that a kprobe is hit during single-stepping. At
this point, the architecture-specific single-step context is already
active. Nested single-stepping would corrupt the state, as the kprobe
control block (kcb) and hardware registers cannot safely store multiple
levels of stepping state.
KPROBE_REENTER means that a third-level recursion occurs when a probe
is hit while the system is already handling a nested probe (second-
level). The kcb only provides a single slot (prev_kprobe) to backup the
state. When a third probe is hit, there is no more space to save the
state without corrupting the first-level backup.
Kprobes work by replacing instructions with breakpoints. In order to
execute the original instruction and continue, it must be moved to a
temporary "single-step" slot. Since there is no backup space left to
set up this slot safely, the CPU would be forced to return to the same
original breakpoint address, triggering an endless loop.
Currently, the code only prints a warning and returns. This leads to
an infinite re-entry loop as the CPU repeatedly hits the same trap and
a "stuck" CPU core because preemption was disabled at the start of the
handler and never re-enabled in this early return path.
Fix the logic by:
1. Merging KPROBE_HIT_SS and KPROBE_REENTER cases, as both represent
fatal recursions that cannot be safely recovered.
2. Replacing WARN_ON_ONCE() with BUG() to terminate the system. This
aligns LoongArch with other architectures (x86, arm64, riscv) and
prevents stack overflow while providing diagnostic information.
Fixes: 6d4cc40fb5f5 ("LoongArch: Add kprobes support")
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
On SMP systems, kprobe handlers would occasionally fail to execute on
certain CPU cores. The issue is hard to reproduce and typically occurs
randomly under high system load.
The root cause is a software-side instruction hazard. According to the
LoongArch Reference Manual, while the cache coherency is maintained by
hardware, software must explicitly use the "IBAR" instruction to ensure
the instruction fetch unit (IFU) observes the effects of recent stores.
The current arch_arm_kprobe() and arch_disarm_kprobe() only execute the
"IBAR" barrier (via flush_insn_slot -> local_flush_icache_range) on the
local CPU. This leaves a vulnerable window where remote CPU cores may
continue executing stale instructions from their pipelines or prefetch
buffers, as they have not executed an "IBAR" since the code modification.
Switch to larch_insn_text_copy() to fix this:
1. Synchronization: It uses stop_machine_cpuslocked() to synchronize all
online CPUs, ensuring no CPU is executing the target code area during
modification.
2. Visibility: By passing cpu_online_mask to stop_machine_cpuslocked(),
the callback text_copy_cb() is executed on all online cores. Each CPU
core invokes local_flush_icache_range() to execute "IBAR", clearing
instruction hazards system-wide and ensuring the "break" instruction
is visible to the fetch units of all cores.
3. Robustness: It properly manages memory write permissions (ROX/RW) for
the kernel text segment during patching, ensuring compatibility with
CONFIG_STRICT_KERNEL_RWX.
Cc: <stable@vger.kernel.org> # 6.18+
Fixes: 6d4cc40fb5f5 ("LoongArch: Add kprobes support")
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Both GCC [1] and Clang [2] consider the generic version of _THIS_IP_ to
be broken:
#define _THIS_IP_ ({ __label__ __here; __here: (unsigned long)&&__here; })
In particular, the address of a label is only expected to be used with a
computed goto.
While the generic version more or less works today, it is known to be
brittle and may break with current and future optimizations. For
example, Clang -O2 always returns 1 when this function is inlined:
static inline unsigned long get_ip(void)
{ return ({ __label__ __here; __here: (unsigned long)&&__here; }); }
Fix it by overriding _THIS_IP_ in <asm/linkage.h> (which is included by
<linux/instruction_pointer.h>) using an architecture-specific inline asm
version. Additionally, avoiding taking the address of a label prevents
compilers from emitting spurious indirect branch targets (e.g. ENDBR or
BTI) under control-flow integrity schemes.
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120071 [1]
Link: https://github.com/llvm/llvm-project/issues/138272 [2]
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: David Laight <david.laight.linux@gmail.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from Bluetooth, wireless and netfilter.
Craziness continues with no end in sight. Even discounting the driver
revert this is a pretty huge PR for standards of the previous era. I'd
speculate - we haven't seen the worst of it, yet. Good news, I guess,
is that so far we haven't seen many (any?) cases of "AI reported a
bug, we fixed it and a real user regressed".
Current release - fix to a fix:
- Bluetooth: btmtk: accept too short WMT FUNC_CTRL events
- vsock/virtio: relax the recently added memory limit a little
Current release - regressions:
- IB/IPoIB: make sure IB drivers always use async set_rx_mode since
some (mlx5) are now required to use it due to locking changes
Previous releases - regressions:
- udp: fix UDP length on last GSO_PARTIAL segment
- af_unix: fix UAF read of tail->len in unix_stream_data_wait()
- tcp: fix stale per-CPU tcp_tw_isn leak enabling ISN prediction
- mlx5e: fix unlocked writing to ICOSQ, breaking AF_XDP
Previous releases - always broken:
- tap: fix stack info leak in tap_ioctl() SIOCGIFHWADDR
- ipv4: raw: reject IP_HDRINCL packets with ihl < 5
- Bluetooth: a lot of locking and concurrency fixes (as always)
- batman-adv (mesh wireless networking): a lot of random fixes for
issues reported by security researchers and Sashiko
- netfilter: same thing, a lot of small security-ish fixes all over
the place, nothing really stands out
Misc:
- bring back the old 3c509 driver, Maciej wants to maintain it"
* tag 'net-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (187 commits)
net: enetc: avoid VF->PF mailbox timeout during SR-IOV teardown
net: enetc: fix init and teardown order to prevent use of unsafe resources
net: enetc: fix unbounded loop and interrupt handling in VF-to-PF messaging
net: enetc: fix DMA write to freed memory in enetc_msg_free_mbx()
net: enetc: fix race condition in VF MAC address configuration
net: enetc: fix TOCTOU race and validate VF MAC address
net: enetc: add ratelimiting to VF mailbox error messages
net: enetc: fix missing error code when pf->vf_state allocation fails
net: enetc: fix incorrect mailbox message status returned to VFs
net: bridge: prevent too big nested attributes in br_fill_linkxstats()
l2tp: use list_del_rcu in l2tp_session_unhash
net: bcmgenet: keep RBUF EEE/PM disabled
ethernet: 3c509: Fix most coding style issues
ethernet: 3c509: Update documentation to match MAINTAINERS
ethernet: 3c509: Add GPL 2.0 SPDX license identifier
ethernet: 3c509: Fix AUI transceiver type selection
Revert "drivers: net: 3com: 3c509: Remove this driver"
tools: ynl: support listening on all nsids
net: gro: don't merge zcopy skbs
pds_core: ensure null-termination for firmware version strings
...
|
|
'20260416-qcom_ice_power_and_clk_vote-v5-13-5ccf5d7e2846@oss.qualcomm.com' into arm64-fixes-for-7.1
Merge the fixes to add power-domain and correct clocks for the ICC block
in Eliza and Milos through a topic branch, to allow them to be merged
also into arm64-for-7.2 to resolve the merge conflicts that would
otherwise appear.
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
|
|
Qualcomm in-line crypto engine (ICE) platform driver specifies and votes
for its own resources. Before accessing ICE hardware during probe, to
avoid potential unclocked register access issues (when clk_ignore_unused
is not passed on the kernel command line), in addition to the 'core' clock
the 'iface' clock should also be turned on by the driver. This can only be
done if the GCC_UFS_PHY_GDSC power domain is enabled. Specify both the
GCC_UFS_PHY_GDSC power domain and the 'iface' clock in the ICE node for
eliza.
Fixes: af20af39fc09b ("arm64: dts: qcom: Introduce Eliza Soc base dtsi")
Signed-off-by: Harshal Dev <harshal.dev@oss.qualcomm.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Fixes: 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return the
Reviewed-by: Kuldeep Singh <kuldeep.singh@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20260416-qcom_ice_power_and_clk_vote-v5-13-5ccf5d7e2846@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
|
|
Qualcomm in-line crypto engine (ICE) platform driver specifies and votes
for its own resources. Before accessing ICE hardware during probe, to
avoid potential unclocked register access issues (when clk_ignore_unused
is not passed on the kernel command line), in addition to the 'core' clock
the 'iface' clock should also be turned on by the driver. This can only be
done if the UFS_PHY_GDSC power domain is enabled. Specify both the
UFS_PHY_GDSC power domain and the 'iface' clock in the ICE node for milos.
Fixes: 04bb37433330e ("arm64: dts: qcom: milos: Add UFS nodes")
Signed-off-by: Harshal Dev <harshal.dev@oss.qualcomm.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Reviewed-by: Kuldeep Singh <kuldeep.singh@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20260416-qcom_ice_power_and_clk_vote-v5-12-5ccf5d7e2846@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
|
|
Flush the current TLB when xAVIC *or* x2AVIC is activated, as KVM is
(apparently) responsible for purging TLB entries when transitioning from
xAVIC to x2AVIC. The APM says a whole lot of nothing about TLB flushing
with respect to (x2)AVIC, but empirical data strongly suggests hardware
also does a whole lot of nothing.
Failure to flush the TLB when enabling x2AVIC can lead to guest accesses
to the APIC base address getting incorrectly redirected to the virtual
APIC page. The flaw most visibly manifests as failures in KVM-Unit-Test's
verify_disabled_apic_mmio() testcase when x2APIC is enabled (though for
reasons unknown, the test only reliably fails with EFI builds).
Fixes: 0ccf3e7cb95a ("KVM: SVM: Flush the "current" TLB when activating AVIC")
Fixes: 4d1d7942e36a ("KVM: SVM: Introduce logic to (de)activate x2AVIC mode")
Cc: stable@vger.kernel.org
Cc: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://patch.msgid.link/20260515171536.1841645-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use kvm_register_mark_dirty() instead of kvm_register_is_dirty() to
actually mark VCPU_EXREG_ERAPS as dirty when emulating
INVPCID_TYPE_SINGLE_CTXT. kvm_register_is_dirty() is a read-only
predicate whose return value is discarded, making the call a no-op.
Without this fix, a single-context INVPCID will not trigger a RAP clear
on the next VMRUN, breaking the ERAPS security guarantee.
Fixes: db5e82496492 ("KVM: SVM: Virtualize and advertise support for ERAPS")
Signed-off-by: Emily Ehlert <ehemily@amazon.de>
Link: https://patch.msgid.link/20260518135956.82569-1-ehemily@amazon.de
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull ring-buffer fixes from Steven Rostedt:
- Fix reporting MISSED EVENTS in trace iterator
When the "trace" file is read with tracing enabled, if the writer
were to pass the iterator reader, it resets, sets a "missed_events"
flag and continues. The tracing output checks for missed events and
if there are some, it prints out "[LOST EVENTS]" to let the user know
events were dropped.
But the clearing of the missed_events happened when the tracing
system queried the ring buffer iterator about missed events. This was
premature as the ring buffer is per CPU, and the tracing code reads
all the CPU buffers and checks for missed events when it is read. If
the CPU iterator that had missed events isn't printed next, the
output for the LOST EVENTS is lost.
Clear the missed_events flag when the iterator moves to the next
event and not when the missed_events flag is queried. Also clear it
on reset.
- Flush and stop the persistent ring buffer on panic
On panic the persistent ring buffer is used to debug what caused the
panic. But on some architectures, it requires flushing the memory
from cache, otherwise, the ring buffer persistent memory may not have
the last events and this could also cause the ring buffer to be
corrupted on the next boot.
- Fix nr_subbufs initialization in simple_ring_buffer_init_mm
The remote simple ring buffer meta data nr_subbufs is initialized too
early and gets cleared later on, making it zero and not reflect the
actual number of sub-buffers.
- Fix unload_page for simple_ring_buffer init rollback
On error, the pages loaded need to be unloaded. To unload a page it
is expected that: page = load_page(va); -> unload_page(page). But the
code was doing: unload_page(va) and not unload_page(page).
- Create output file from cmd_check_undefined
The check for undefined symbols checks if the file *.o.checked exists
and if so it skips doing the work. But the *.o.checked file never was
created making every build do the work even when it was already done
previously.
* tag 'trace-ringbuffer-v7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Create output file from cmd_check_undefined
tracing: Fix unload_page for simple_ring_buffer init rollback
tracing: Fix nr_subbufs initialization in simple_ring_buffer_init_mm()
ring-buffer: Flush and stop persistent ring buffer on panic
ring-buffer: Fix reporting of missed events in iterator
|