linux.git/virt, branch v7.2-rc1

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

2026-06-25T17:21:13+00:00

Pull kvm fixes from Paolo Bonzini:
 "s390:

   - Fix S390_USER_OPEREXEC so it can now be enabled regardless of other
     unrelated capabilities

   - Fix handling of the _PAGE_UNUSED pte bit that could lead to guest
     memory corruption in some scenarios

   - A bunch of misc gmap fixes (locking, behaviour under memory
     pressure)

   - Fix CMMA dirty tracking

  x86:

   - Tidy up some WARN_ON() and BUG_ON(), replacing them with
     WARN_ON_ONCE() or KVM_BUG_ON(). All of these have obviously never
     triggered, or somebody would have been annoyed earlier, but still...

   - Fix missing interrupt due to stale CR8 intercept

   - Add a statistic that can come in handy to debug leaks as well as
     the vulnerability to a class of recently-discovered issues

   - Do not ask arch/x86/kernel to export
     default_cpu_present_to_apicid() just for KVM"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits)
  x86/apic: KVM: Use cpu_physical_id() to get APIC ID of running vCPU for AVIC
  KVM: x86/mmu: Expose number of shadow MMU shadow pages as a stat
  KVM: x86: Unconditionally recompute CR8 intercept on PPR update
  KVM: VMX: Grab vmcs12 on CR8 interception update iff vCPU is in guest mode
  KVM: x86: WARN (once) if RTC pending EOI tracking goes off the rails
  KVM: x86: WARN and fail kvm_set_irq() if a PIC or I/O APIC vector is invalid
  KVM: x86: Bug the VM, not the kernel, if the ISR count {under,over}flows
  KVM: x86/mmu: Bug the VM, not the host kernel, if KVM write-protects upper SPTEs
  KVM: x86: Replace BUG_ON() with WARN_ON_ONCE() on "bad" nested GPA translation
  KVM: Replace guest-triggerable BUG_ON() in ioeventfd datamatch with get_unaligned()
  KVM: s390: Return failure in case of failure in kvm_s390_set_cmma_bits()
  KVM: s390: selftests: Fix cmma selftest
  KVM: s390: Fix cmma dirty tracking
  KVM: s390: Fix locking in kvm_s390_set_mem_control()
  KVM: s390: Fix handle_{sske,pfmf} under memory pressure
  KVM: s390: Fix code typo in gmap_protect_asce_top_level()
  KVM: s390: Do not set special large pages dirty
  KVM: s390: Fix dat_peek_cmma() overflow
  s390/mm: Fix handling of _PAGE_UNUSED pte bit
  KVM: s390: Fix typo in UCONTROL documentation
  ...

KVM: Replace guest-triggerable BUG_ON() in ioeventfd datamatch with get_unaligned()

2026-06-24T09:25:14+00:00

Drop a BUG_ON() that has been reachable since it was first added, way back
in 2009, and instead use get_unaligned() to perform potentially-unaligned
accesses.

For a given store, KVM x86's emulator tracks the entire value in the
destination operand, x86_emulate_ctxt.dst.  If the destination is memory,
and the target splits multiple pages and/or is emulated MMIO, then KVM
handles each fragment independently.  E.g. on a page split starting at page
offset 0xffc, KVM writes 4 bytes to the first page, then the remaining
bytes to the second page, using ctxt->dst as the source for both (with
appropriate offsets).

If the destination splits a page *and* hits emulated MMIO on the second
page, then KVM will complete the write to the first page, then emulate the
MMIO access to the second page.  If there is a datamatch-enabled ioeventfd
at offset 0 of the second page, then KVM will process the remainder of the
store as a potential ioeventfd signal.

Putting it all together, if the guest emits a store that splits a page
starting at page offset N, and the second page has a datamatch-enabled
ioeventfd at offset 0, then KVM will check for datamatch using
&dst.valptr[N] as the source.  Due to dst (and thus dst.valptr) being
32-byte aligned, if N is not aligned to @len, the BUG_ON() fires.

E.g. with a 16-byte store at page offset 0xffc, to an ioeventfd of len 8,
all initial checks in ioeventfd_in_range() will succeed, and the BUG_ON()
fires due to @val being 4-byte aligned, but not 8-byte aligned.

  ------------[ cut here ]------------
  kernel BUG at arch/x86/kvm/../../../virt/kvm/eventfd.c:783!
  Oops: invalid opcode: 0000 [#1] SMP
  CPU: 0 UID: 1000 PID: 615 Comm: repro Not tainted 7.1.0-rc2-ff238429d1ea #365 PREEMPT
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:ioeventfd_write+0x6c/0x70 [kvm]
  Call Trace:
   
   __kvm_io_bus_write+0x85/0xb0 [kvm]
   kvm_io_bus_write+0x53/0x80 [kvm]
   vcpu_mmio_write+0x66/0xf0 [kvm]
   emulator_read_write_onepage+0x12a/0x540 [kvm]
   emulator_read_write+0x109/0x2b0 [kvm]
   x86_emulate_insn+0x4f8/0xfb0 [kvm]
   x86_emulate_instruction+0x181/0x790 [kvm]
   kvm_mmu_page_fault+0x313/0x630 [kvm]
   vmx_handle_exit+0x18a/0x590 [kvm_intel]
   kvm_arch_vcpu_ioctl_run+0xc81/0x1c90 [kvm]
   kvm_vcpu_ioctl+0x2d5/0x970 [kvm]
   __x64_sys_ioctl+0x8a/0xd0
   do_syscall_64+0xb7/0x890
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
  RIP: 0033:0x7f19c931a9bf
   
  Modules linked in: kvm_intel kvm irqbypass
  ---[ end trace 0000000000000000 ]---

In a perfect world, the fix would be to simply delete the BUG_ON(), as KVM
x86 doesn't perform alignment checks on "normal" memory accesses at CPL0.
Sadly, C99 ruins all the fun; while the x86 architecture plays nice,
dereferencing an unaligned pointer directly is undefined behavior in C,
e.g. triggers splats when running with CONFIG_UBSAN_ALIGNMENT=y.

Fixes: d34e6b175e61 ("KVM: add ioeventfd support")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson 
Message-ID: <20260612225241.678509-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

2026-06-19T15:56:49+00:00

Pull kvm updates from Paolo Bonzini:
 "arm64:

     This is a bit of an odd merge window on the KVM/arm64 front. There
     is absolutely no new feature in the pull request. It is purely
     fixes, because it is simply becoming too hard to review new stuff
     when so many AI-fuelled fixes hit the list.

   - Significant cleanup of the vgic-v5 PPI support which was merged in
     7.1. This makes the code more maintainable, and squashes a couple
     of bugs in the meantime

   - Set of fixes for the handling of the MMU in an NV context,
     particularly VNCR-triggered faults. S1POE support is fixed as well

   - Large set of pKVM fixes, mostly addressing recurring issues around
     hypervisor tracking of donated pages in obscure cases where the
     donation could fail and leave things in a bizarre state

   - Fixes for the so-called "lazy vgic init", which resulted in
     sleeping operations in non-preemptible sections. This turned out to
     be far more invasive than initially expected..

   - Reduce the overhead of L1/L2 context switch by not touching the FP
     registers

   - Fix the way non-implemented page sizes are dealt with when a guest
     insist on using them for S2 translation

   - The usual set of low-impact fixes and cleanups all over the map

  Loongarch:

   - On a request for lazy FPU load, load all FPU state that the VM
     supports instead of enabling only the part (FPU, LSX or LASX) that
     caused the FPU load request

   - Some enhancements about interrupt injection

   - Some bug fixes and other small changes

  RISC-V:

   - Batch G-stage TLB flushes for GPA range based page table updates

   - Convert HGEI line management to fully per-HART

   - Fix missing CSR dirty marking when FWFT state updated via ONE_REG

   - Fix stale FWFT feature exposure to Guest/VM

   - Speed up dirty logging write faults using MMU rwlock and atomic PTE
     updates using cmpxchg() for permission-only changes

   - Use flexible array for APLIC IRQ state

   - Use kvm_slot_dirty_track_enabled() for logging enable check on a
     memslot

   - Avoid skipping valid pages in kvm_riscv_gstage_wp_range()

   - Avoid skipping valid pages in kvm_riscv_gstage_unmap_range()

   - Use endian-specific __lelong for NACL shared memory

  S390:

   - KVM_PRE_FAULT_MEMORY support

   - Support for 2G hugepages

   - Support for the ASTFLEIE 2 facility

   - Support for fast inject using kvm_arch_set_irq_inatomic

   - Fix potential leak of uninitialized bytes

   - A few more misc gmap fixes

  x86:

   - Generic support for the more granular permissions allowed by EPT,
     namely "read" (which was previously usurping the U bit) and
     separate execution bits for kernel and userspace

   - Do not assume that all page tables start with U=1/W=1/NX=0 at the
     root, as AMD GMET needs to have U=0 at the root

   - Introduce common assembly macros for use within Intel and AMD
     vendor-specific vmentry code. This touches the SPEC_CTRL handling,
     which is now entirely done in assembly for Intel (by reusing the
     AMD code that already existed), and register save/restore which
     uses some macro magic to compute the offsets in the struct. Both of
     these are preparatory changes for upcoming APX support

   - Clean up KVM's register tracking and storage, primarily to prepare
     for APX support, which expands the maximum number of GPRs from 16
     to 32

   - Keep a single copy of the PDPTRs rather than two, since
     architecturally there is just one

   - Handle EXIT_FASTPATH_EXIT_USERSPACE in vendor code to ensure vendor
     code gets a chance to handle things like reaping the PML buffer

   - Update KVM's view of PV async enabling if and only if the MSR write
     fully succeeds

   - Fix a variety of issues where the emulator doesn't honor
     guest-debug state, and clean up related code along the way

   - Synthesize EPT Violation and #NPF "error code" bits when injecting
     faults into L1 that didn't originate in hardware (in which case the
     VMCS/VMCB doesn't hold relevant information)

   - Add support for virtualizing (well, emulating) AMD's flavor of
     CPL>0 CPUID faulting

   - Clean up the GPR APIs so that KVM's use of "raw" is consistent, and
     fix a variety of minor bugs along the way

   - Fix an OOB memory access due to not checking the VP ID when
     handling a Hyper-V PV TLB flush for L2

   - Fix a bug in the mediated PMU's handling of fixed counters that
     allowed the guest to bypass the PMU event filter

   - Allow userspace to return EAGAIN when handling SNP and TDX
     hypercalls, so the KVM can forward a "retry" status code to the
     guest, and reserve all unused error codes for future usage

   - Overhaul the TDP MMU => S-EPT code to move as much S-EPT specific
     logic as possible into the TDX code, and to funnel (almost) all
     S-EPT updates into a single chokepoint. The motivation is largely
     to prepare for upcoming Dynamic PAMT support, but the cleanups are
     nice to have on their own

   - Plug a hole in shadow page table handling, where KVM fails to
     recursively zap nested EPT/NPT shadow page tables when the nested
     hypervisor tears down its own EPT/NPT page tables from the bottom
     up

  x86 (Intel):

   - Support for nested MBEC (Mode-Based Execute Control), see above in
     the generic section; also run with MBEC enabled even for non-nested
     mode

   - Use the kernel's "enum pg_level" in the TDX APIs instead of the
     TDX-Module's level definitions (which are 0-based)

   - Rework the TDX memory APIs to not require/assume that guest memory
     is backed by "struct page" (in prepartion for guest_memfd hugepage
     support)

   - Fix a largely benign bug where KVM TDX would incorrectly state it
     could emulate several x2APIC MSRs

   - Use the "safe" WRMSR API when proxying LBR MSR writes as the
     to-be-written value is guest controlled and completely unvalidated

  x86 (AMD):

   - Support for nested GMET (Guest Mode Execution Trap), see above in
     the generic section; also run with GMET enabled even for non-nested
     mode

   - Fixes and minor cleanups to GHCB handling, on top of the earlier
     work already merged into 7.1-rc

   - Ensure KVM's copy of CR0 and CR3 are up-to-date prior to invoking
     fastpath handlers

   - Add support for virtualizing gPAT (KVM previously just used L1's
     PAT when running L2)

   - Fix goofs where KVM mishandles side effects (e.g. single-step and
     PMC updates) when emulating VMRUN

   - Fix a variety of bugs in AVIC's handling of x2APIC MSR
     interception, most notably where KVM didn't disable interception of
     IRR, ISR, and TMR regs

   - Add support for virtualizing Host-Only/Guest-Only bits in the
     mediated PMU

   - Don't advertise support for unusable VM types, and account for VM
     types that are disabled by firmware, e.g. to mitigate security
     vulnerabilities

   - Rewrite the SEV {en,de}crypt debug ioctls as they were riddle with
     bugs and unnecessarily complicated, and add comprehensive tests

   - Clean up and deduplicate the SEV page pinning code

   - Fix minor goofs related to writing back CPUID information after
     firmware rejects a CPUID page for an SNP vCPU

  Generic:

   - Rename invalidate_begin() to invalidate_start() throughout KVM to
     follow the kernel's nomenclature, e.g. for mmu_notifiers

   - Use guard() to cleanup up various KVM+VFIO flows

   - Minor cleanups

  guest_memfd:

   - Return -EEXIST instead of -EINVAL if userspace attempts to bind a
     gmem range to multiple memslots, and fix the test that was supposed
     to ensure KVM returns -EEXIST

   - Treat memslot binding offsets and sizes as unsigned values to fix a
     bug where KVM interprets a large "offset + size" as a negative
     value and allows a nonsensical offset

   - Use the inode number instead of the page offset for the NUMA
     interleaving index to fix a bug where the effective index would
     jump by two for consecutive pages (the caller also adds in the page
     offset)

  Selftests:

   - Randomize the dirty log test's delay when reaping the bitmap on the
     first pass, as always waiting only 1ms hid a KVM RISC-V bug as the
     test reaped the bitmap before KVM could build up enough state to
     hit the bug

   - A pile of one-off fixes and cleanups"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (326 commits)
  KVM: x86/mmu: Ensure hugepage is in by slot before checking max mapping level
  KVM: x86: Fix shadow paging use-after-free due to unexpected role
  KVM: s390: Introducing kvm_arch_set_irq_inatomic fast inject
  KVM: s390: Enable adapter_indicators_set to use mapped pages
  KVM: s390: Add map/unmap ioctl and clean mappings post-guest
  riscv: kvm: Use endian-specific __lelong for NACL shared memory
  KVM: selftests: access_tracking_perf_test: bump number of NUMA nodes to 32
  KVM: s390: vsie: Implement ASTFLEIE facility 2
  KVM: s390: vsie: Refactor handle_stfle
  s390/sclp: Detect ASTFLEIE 2 facility
  KVM: s390: Minor refactor of base/ext facility lists
  KVM: x86/mmu: move pdptrs out of the MMU
  KVM: x86: check that kvm_handle_invpcid is only invoked with shadow paging
  KVM: nSVM: invalidate cached PDPTRs across nested NPT transitions
  KVM: nVMX: remove unnecessary code in prepare_vmcs02_rare
  KVM: x86: remove nested_mmu from mmu_is_nested()
  KVM: arm64: vgic-its: Make ABI commit helpers return void
  KVM: s390: Initialize KVM_S390_GET_CMMA_BITS memory
  LoongArch: KVM: Add missing slots_lock for device register/unregister
  LoongArch: KVM: Validate irqchip index in irqfd routing
  ...

Merge tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2026-06-14T22:29:45+00:00

Pull misc vfs updates from Christian Brauner:
 "Features:

   - Reduce pipe->mutex contention by pre-allocating pages outside the
     lock in anon_pipe_write().

     anon_pipe_write() called alloc_page() once per page while holding
     pipe->mutex. The allocation can sleep doing direct reclaim and runs
     memcg charging, which extends the critical section and stalls any
     concurrent reader on the same mutex. Now up to 8 pages are
     pre-allocated before the mutex is taken, leftovers are recycled
     into the per-pipe tmp_page[] cache before unlock, and any remainder
     is released after unlock, keeping the allocator out of the critical
     section on both sides. On a writers x readers sweep with 64KB
     writes against a 1 MB pipe throughput improves 6-28% and average
     write latency drops 5-22%; under memory pressure - when the cost of
     holding the mutex across reclaim is highest - throughput improves
     21-48% and latency drops 17-33%. The microbenchmark is added to
     selftests.

   - uaccess/sockptr: fix the ignored_trailing logic in
     copy_struct_to_user() to behave as documented and the usize check
     in copy_struct_from_sockptr() for user pointers, and add
     copy_struct_{from,to}_bounce_buffer() and copy_struct_to_sockptr()
     helpers for upcoming users (IPPROTO_SMBDIRECT, IPPROTO_QUIC).

   - bpf: add a sleepable bpf_real_inode() kfunc that resolves the real
     inode backing a dentry via d_real_inode(). On overlayfs the inode
     attached to the dentry doesn't carry the underlying device
     information; this is used by the filesystem restriction BPF program
     that was merged into systemd.

   - docs: add guidelines for submitting new filesystems, motivated by
     the maintenance burden abandoned and untestable filesystems impose
     on VFS developers, blocking infrastructure work like folio
     conversions and iomap migration.

  Fixes:

   - libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
     and drop the now-redundant assignments in callers. This began as a
     one-line dma-buf fix for a path_noexec() warning; a pseudo
     filesystem has no reason not to set SB_I_NOEXEC. All init_pseudo()
     callers were audited: the only visible effect is on dma-buf where
     SB_I_NOEXEC silences the warning.

   - Handle set_blocksize() failures in legacy filesystems (bfs, hpfs,
     qnx4, jfs, befs, affs, isofs, minix, ntfs3, omfs). Mounting a
     device with a sector size > PAGE_SIZE crashed roughly half of them;
     the rest had the same missing error handling pattern. Plus a
     follow-up releasing the superblock buffer_head when setting the
     minix v3 block size fails.

   - mount: honour SB_NOUSER in the new mount API.

   - fs/fcntl: fix a SOFTIRQ-unsafe lock order in fasync signaling by
     switching the process-group paths of send_sigio() and send_sigurg()
     from read_lock(&tasklist_lock) to RCU, matching the single-PID
     path.

   - vfs: add an FS_USERNS_DELEGATABLE flag and set it for NFS, fixing
     delegated NFS mounts (fsopen() in a container with the mount
     performed by a privileged daemon) that broke when non-init
     s_user_ns was tied to FS_USERNS_MOUNT.

   - selftests/namespaces: fix a hang in nsid_test where an unreaped
     grandchild kept the TAP pipe write-end open, a waitpid(-1) race in
     listns_efault_test, and a false FAIL on kernels without listns()
     where the tests should SKIP.

   - filelock: fix the break_lease() stub signature for
     CONFIG_FILE_LOCKING=n.

   - init/initramfs_test: wait for the async initramfs unpacking before
     running; the test and do_populate_rootfs() share the parser state.

   - fs/coredump: reduce redundant log noise in
     validate_coredump_safety().

   - iomap: pass the correct length to fserror_report_io() in
     __iomap_write_begin().

   - backing-file: fix the backing_file_open() kerneldoc.

  Cleanups:

   - initramfs: refactor the cpio hex header parsing to use hex2bin()
     instead of the hand-rolled simple_strntoul() which is reverted, and
     extend the initramfs KUnit tests to cover header fields with 0x
     prefixes.

   - Replace __get_free_pages() and friends with kmalloc()/kzalloc()
     across quota, proc, ocfs2/dlm, nilfs2, nfs, nfsd, libfs, jfs, jbd2,
     isofs, fuse, select, namespace, configfs, binfmt_misc, bfs, and the
     do_mounts init code - part of the larger work of replacing page
     allocator calls with kmalloc().

   - Use clear_and_wake_up_bit() in unlock_buffer() and
     journal_end_buffer_io_sync() instead of open-coding the sequence.

   - Drop unused VFS exports: unexport drop_super_exclusive(), remove
     start_removing_user_path_at(), and fold __start_removing_path()
     into start_removing_path().

   - fs/read_write: narrow the __kernel_write() export with
     EXPORT_SYMBOL_FOR_MODULES().

   - vfs: uapi: retire octal and hex constants in favor of (1 << n) for
     the O_ flags. Finding a free bit for a new flag across the
     architectures was needlessly hard with the mixed bases.

   - dcache: add extra sanity checks of dead dentries in dentry_free()
     via a new DENTRY_WARN_ONCE() that also prints d_flags.

   - iov_iter: use kmemdup_array() in dup_iter() to harden the
     allocation against multiplication overflow.

   - fs/pipe: write to ->poll_usage only once.

   - vfs: remove an always-taken if-branch in find_next_fd().

   - dcache: use kmalloc_flex() for struct external_name in __d_alloc().

   - namei: use QSTR() instead of QSTR_INIT() in path_pts().

   - sync_file_range: delete dead S_ISLNK code.

   - Comment fixes: retire a stale comment in fget_task_next() and fix
     assorted spelling mistakes"

* tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (73 commits)
  backing-file: fix backing_file_open() kerneldoc parameter
  iomap: pass the correct len to fserror_report_io in __iomap_write_begin
  vfs: add FS_USERNS_DELEGATABLE flag and set it for NFS
  filelock: fix break_lease() stub signature for CONFIG_FILE_LOCKING=n
  vfs: uapi: retire octal and hex numbers in favor of (1 << n) for O_ flags
  bpf: add bpf_real_inode() kfunc
  fs/read_write: Do not export __kernel_write() to the entire world
  libfs: drop redundant SB_I_NOEXEC/SB_I_NODEV in init_pseudo() callers
  libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
  mount: honour SB_NOUSER in the new mount API
  fs/fcntl: fix SOFTIRQ-unsafe lock order in fasync signaling
  selftests/pipe: add pipe_bench microbenchmark
  fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
  fs: retire stale comment in fget_task_next()
  fs: fix spelling mistakes in comment
  bfs: replace get_zeroed_page() with kzalloc()
  binfmt_misc: replace __get_free_page() with kmalloc()
  configfs: replace __get_free_pages() with kzalloc()
  fs/namespace: use __getname() to allocate mntpath buffer
  fs/select: replace __get_free_page() with kmalloc()
  ...

Merge tag 'kvm-x86-vfio-7.2' of https://github.com/kvm-x86/linux into HEAD

2026-06-12T08:14:42+00:00

KVM VFIO changes for 7.2

Use guard() to cleanup up various KVM+VFIO flows.

Merge tag 'kvm-x86-sev-7.2' of https://github.com/kvm-x86/linux into HEAD

2026-06-12T08:13:03+00:00

KVM SEV changes for 7.2

 - Don't advertise support for unusuable VM types, and account for VM types
   that are disabled by firmware, e.g. to mitigate security vulnerabilities.

 - Rewrite the SEV {en,de}crypt debug ioctls as they were riddle with bugs and
   unnecessarily complicated, and add comprehensive tests.

 - Clean up and deduplicate the SEV page pinning code.

 - Fix minor goofs related to writing back CPUID information after firmware
   rejects a CPUID page for an SNP vCPU.

Merge tag 'kvm-x86-gmem-7.2' of https://github.com/kvm-x86/linux into HEAD

2026-06-12T08:08:52+00:00

KVM guest_memfd changes for 7.2

 - Return -EEXIST instead of -EINVAL if userspace attempts to bind a gmem
   range to multiple memslots, and fix the test that was supposed to ensure
   KVM returns -EEXIST.

 - Treat memslot binding offsets and sizes as unsigned values to fix a bug
   where KVM interprets a large "offset + size" as a negative value and allows
   a nonsensical offset.

 - Use the inode number instead of the page offset for the NUMA interleaving
   index to fix a bug where the effective index would jump by two for
   consecutive pages (the caller also adds in the page offset).

Merge tag 'kvm-x86-generic-7.2' of https://github.com/kvm-x86/linux into HEAD

2026-06-12T07:56:41+00:00

KVM generic changes for 7.2

 - Rename invalidate_begin() to invalidate_start() throughout KVM to follow
   the kernel's nomenclature, e.g. for mmu_notifiers.

 - Minor cleanups.

KVM: guest_memfd: fix NUMA interleave index double-counting

2026-06-08T16:15:43+00:00

kvm_gmem_get_policy() sets the interleave index (the output param that's
typically named "ilx") to the full page offset (vm_pgoff + vma offset).
But get_vma_policy() adds the page offset on top of the interleave index,
and so the offset is counted twice.  This causes NUMA interleaving to skip
nodes: for order-0 pages the effective index jumps by 2 for each
consecutive page.

The vm_op.get_policy() implementation should return only a per-file bias in
the interleave index (like shmem_get_policy does with inode->i_ino),
letting get_vma_policy() add the page-offset component.

Fix by setting the output interleave index to the inode number (a la shmem)
instead of the full page offset, as the index is intended to be a constant,
semi-random value for a given file, e.g. so that interleaving doesn't start
at the same node for every file, and so that allocations are round-robined
across nodes based on the page offset (the selected node would bounce/skip
around if the index isn't constant).

Found by Sashiko (sashiko.dev) AI code review.

Fixes: ed1ffa810bd6 ("KVM: guest_memfd: Enforce NUMA mempolicy using shared policy")
Cc: Sean Christopherson 
Cc: Paolo Bonzini 
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Michael S. Tsirkin 
Acked-by: David Hildenbrand (Arm) 
Reviewed-by: Shivank Garg 
Tested-by: Shivank Garg 
Fixes: 7f3779a3ac3e ("mm/filemap: Add NUMA mempolicy support to filemap_alloc_folio()")
Link: https://patch.msgid.link/0eff0a90667b900bee837d06b5db5025e1f304b5.1780501924.git.mst@redhat.com
[sean: use reverse fir-tree, massage changelog]
Signed-off-by: Sean Christopherson

libfs: drop redundant SB_I_NOEXEC/SB_I_NODEV in init_pseudo() callers

2026-06-04T08:10:49+00:00

init_pseudo() now sets SB_I_NOEXEC and SB_I_NODEV by default, so the
per-caller assignments are redundant. Drop them.

Signed-off-by: John Hubbard 
Link: https://patch.msgid.link/20260604025315.245910-3-jhubbard@nvidia.com
Signed-off-by: Christian Brauner (Amutable)