linux-stable.git/include/linux, branch v7.0.2

mm/userfaultfd: fix hugetlb fault mutex hash calculation

2026-04-22T11:32:23+00:00

commit 0217c7fb4de4a40cee667eb21901f3204effe5ac upstream.

In mfill_atomic_hugetlb(), linear_page_index() is used to calculate the
page index for hugetlb_fault_mutex_hash().  However, linear_page_index()
returns the index in PAGE_SIZE units, while hugetlb_fault_mutex_hash()
expects the index in huge page units.  This mismatch means that different
addresses within the same huge page can produce different hash values,
leading to the use of different mutexes for the same huge page.  This can
cause races between faulting threads, which can corrupt the reservation
map and trigger the BUG_ON in resv_map_release().

Fix this by introducing hugetlb_linear_page_index(), which returns the
page index in huge page granularity, and using it in place of
linear_page_index().

Link: https://lkml.kernel.org/r/20260310110526.335749-1-jianhuizzzzz@gmail.com
Fixes: a08c7193e4f1 ("mm/filemap: remove hugetlb special casing in filemap.c")
Signed-off-by: Jianhui Zhou 
Reported-by: syzbot+f525fd79634858f478e7@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=f525fd79634858f478e7
Acked-by: SeongJae Park 
Reviewed-by: David Hildenbrand (Arm) 
Acked-by: Mike Rapoport (Microsoft) 
Cc: Jane Chu 
Cc: Andrea Arcangeli 
Cc: Hugh Dickins 
Cc: JonasZhou 
Cc: Muchun Song 
Cc: Oscar Salvador 
Cc: Peter Xu 
Cc: SeongJae Park 
Cc: Sidhartha Kumar 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Greg Kroah-Hartman

KVM: x86: Use scratch field in MMIO fragment to hold small write values

2026-04-22T11:32:21+00:00

commit 0b16e69d17d8c35c5c9d5918bf596c75a44655d3 upstream.

When exiting to userspace to service an emulated MMIO write, copy the
to-be-written value to a scratch field in the MMIO fragment if the size
of the data payload is 8 bytes or less, i.e. can fit in a single chunk,
instead of pointing the fragment directly at the source value.

This fixes a class of use-after-free bugs that occur when the emulator
initiates a write using an on-stack, local variable as the source, the
write splits a page boundary, *and* both pages are MMIO pages.  Because
KVM's ABI only allows for physically contiguous MMIO requests, accesses
that split MMIO pages are separated into two fragments, and are sent to
userspace one at a time.  When KVM attempts to complete userspace MMIO in
response to KVM_RUN after the first fragment, KVM will detect the second
fragment and generate a second userspace exit, and reference the on-stack
variable.

The issue is most visible if the second KVM_RUN is performed by a separate
task, in which case the stack of the initiating task can show up as truly
freed data.

  ==================================================================
  BUG: KASAN: use-after-free in complete_emulated_mmio+0x305/0x420
  Read of size 1 at addr ffff888009c378d1 by task syz-executor417/984

  CPU: 1 PID: 984 Comm: syz-executor417 Not tainted 5.10.0-182.0.0.95.h2627.eulerosv2r13.x86_64 #3
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 Call Trace:
  dump_stack+0xbe/0xfd
  print_address_description.constprop.0+0x19/0x170
  __kasan_report.cold+0x6c/0x84
  kasan_report+0x3a/0x50
  check_memory_region+0xfd/0x1f0
  memcpy+0x20/0x60
  complete_emulated_mmio+0x305/0x420
  kvm_arch_vcpu_ioctl_run+0x63f/0x6d0
  kvm_vcpu_ioctl+0x413/0xb20
  __se_sys_ioctl+0x111/0x160
  do_syscall_64+0x30/0x40
  entry_SYSCALL_64_after_hwframe+0x67/0xd1
  RIP: 0033:0x42477d
  Code: <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
  RSP: 002b:00007faa8e6890e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  RAX: ffffffffffffffda RBX: 00000000004d7338 RCX: 000000000042477d
  RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000005
  RBP: 00000000004d7330 R08: 00007fff28d546df R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004d733c
  R13: 0000000000000000 R14: 000000000040a200 R15: 00007fff28d54720

  The buggy address belongs to the page:
  page:0000000029f6a428 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x9c37
  flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
  raw: 000fffffc0000000 0000000000000000 ffffea0000270dc8 0000000000000000
  raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected

  Memory state around the buggy address:
  ffff888009c37780: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  ffff888009c37800: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  >ffff888009c37880: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                                                   ^
  ffff888009c37900: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  ffff888009c37980: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  ==================================================================

The bug can also be reproduced with a targeted KVM-Unit-Test by hacking
KVM to fill a large on-stack variable in complete_emulated_mmio(), i.e. by
overwrite the data value with garbage.

Limit the use of the scratch fields to 8-byte or smaller accesses, and to
just writes, as larger accesses and reads are not affected thanks to
implementation details in the emulator, but add a sanity check to ensure
those details don't change in the future.  Specifically, KVM never uses
on-stack variables for accesses larger that 8 bytes, e.g. uses an operand
in the emulator context, and *all* reads are buffered through the mem_read
cache.

Note!  Using the scratch field for reads is not only unnecessary, it's
also extremely difficult to handle correctly.  As above, KVM buffers all
reads through the mem_read cache, and heavily relies on that behavior when
re-emulating the instruction after a userspace MMIO read exit.  If a read
splits a page, the first page is NOT an MMIO page, and the second page IS
an MMIO page, then the MMIO fragment needs to point at _just_ the second
chunk of the destination, i.e. its position in the mem_read cache.  Taking
the "obvious" approach of copying the fragment value into the destination
when re-emulating the instruction would clobber the first chunk of the
destination, i.e. would clobber the data that was read from guest memory.

Fixes: f78146b0f923 ("KVM: Fix page-crossing MMIO")
Suggested-by: Yashu Zhang 
Reported-by: Yashu Zhang 
Closes: https://lore.kernel.org/all/369eaaa2b3c1425c85e8477066391bc7@huawei.com
Cc: stable@vger.kernel.org
Tested-by: Tom Lendacky 
Tested-by: Rick Edgecombe 
Link: https://patch.msgid.link/20260225012049.920665-2-seanjc@google.com
Signed-off-by: Sean Christopherson 
Signed-off-by: Greg Kroah-Hartman

x86: rename and clean up __copy_from_user_inatomic_nocache()

2026-04-22T11:32:21+00:00

commit 5de7bcaadf160c1716b20a263cf8f5b06f658959 upstream.

Similarly to the previous commit, this renames the somewhat confusingly
named function.  But in this case, it was at least less confusing: the
__copy_from_user_inatomic_nocache is indeed copying from user memory,
and it is indeed ok to be used in an atomic context, so it will not warn
about it.

But the previous commit also removed the NTB mis-use of the
__copy_from_user_inatomic_nocache() function, and as a result every
call-site is now _actually_ doing a real user copy.  That means that we
can now do the proper user pointer verification too.

End result: add proper address checking, remove the double underscores,
and change the "nocache" to "nontemporal" to more accurately describe
what this x86-only function actually does.  It might be worth noting
that only the target is non-temporal: the actual user accesses are
normal memory accesses.

Also worth noting is that non-x86 targets (and on older 32-bit x86 CPU's
before XMM2 in the Pentium III) we end up just falling back on a regular
user copy, so nothing can actually depend on the non-temporal semantics,
but that has always been true.

Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

KVM: SEV: Disallow LAUNCH_FINISH if vCPUs are actively being created

2026-04-22T11:32:20+00:00

commit 624bf3440d7214b62c22d698a0a294323f331d5d upstream.

Reject LAUNCH_FINISH for SEV-ES and SNP VMs if KVM is actively creating
one or more vCPUs, as KVM needs to process and encrypt each vCPU's VMSA.
Letting userspace create vCPUs while LAUNCH_FINISH is in-progress is
"fine", at least in the current code base, as kvm_for_each_vcpu() operates
on online_vcpus, LAUNCH_FINISH (all SEV+ sub-ioctls) holds kvm->mutex, and
fully onlining a vCPU in kvm_vm_ioctl_create_vcpu() is done under
kvm->mutex.  I.e. there's no difference between an in-progress vCPU and a
vCPU that is created entirely after LAUNCH_FINISH.

However, given that concurrent LAUNCH_FINISH and vCPU creation can't
possibly work (for any reasonable definition of "work"), since userspace
can't guarantee whether a particular vCPU will be encrypted or not,
disallow the combination as a hardening measure, to reduce the probability
of introducing bugs in the future, and to avoid having to reason about the
safety of future changes related to LAUNCH_FINISH.

Cc: Jethro Beekman 
Closes: https://lore.kernel.org/all/b31f7c6e-2807-4662-bcdd-eea2c1e132fa@fortanix.com
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260310234829.2608037-5-seanjc@google.com
Signed-off-by: Sean Christopherson 
Signed-off-by: Greg Kroah-Hartman

Merge tag 'timers-urgent-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2026-04-12T17:01:55+00:00

Pull timer fixes from Thomas Gleixner:
 "Two fixes for the time/timers subsystem:

   - Invert the inverted fastpath decision in check_tick_dependency(),
     which prevents NOHZ full to stop the tick. That's a regression
     introduced in the 7.0 merge window.

   - Prevent a unpriviledged DoS in the clockevents code, where user
     space can starve the timer interrupt by arming a timerfd or posix
     interval timer in a tight loop with an absolute expiry time in the
     past. The fix turned out to be incomplete and was was amended
     yesterday to make it work on some 20 years old AMD machines as
     well. All issues with it have been confirmed to be resolved by
     various reporters"

* tag 'timers-urgent-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clockevents: Prevent timer interrupt starvation
  tick/nohz: Fix inverted return value in check_tick_dependency() fast path

Merge tag 'riscv-for-linus-v7.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

2026-04-11T00:27:08+00:00

Pull RISC-V updates from Paul Walmsley:
 "Before v7.0 is released, fix a few issues with the CFI patchset,
  merged earlier in v7.0-rc, that primarily affect interfaces to
  non-kernel code:

   - Improve the prctl() interface for per-task indirect branch landing
     pad control to expand abbreviations and to resemble the speculation
     control prctl() interface

   - Expand the "LP" and "SS" abbreviations in the ptrace uapi header
     file to "branch landing pad" and "shadow stack", to improve
     readability

   - Fix a typo in a CFI-related macro name in the ptrace uapi header
     file

   - Ensure that the indirect branch tracking state and shadow stack
     state are unlocked immediately after an exec() on the new task so
     that libc subsequently can control it

   - While working in this area, clean up the kernel-internal,
     cross-architecture prctl() function names by expanding the
     abbreviations mentioned above"

* tag 'riscv-for-linus-v7.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  prctl: cfi: change the branch landing pad prctl()s to be more descriptive
  riscv: ptrace: cfi: expand "SS" references to "shadow stack" in uapi headers
  prctl: rename branch landing pad implementation functions to be more explicit
  riscv: ptrace: expand "LP" references to "branch landing pads" in uapi headers
  riscv: cfi: clear CFI lock status in start_thread()
  riscv: ptrace: cfi: fix "PRACE" typo in uapi header

clockevents: Prevent timer interrupt starvation

2026-04-10T20:45:38+00:00

Calvin reported an odd NMI watchdog lockup which claims that the CPU locked
up in user space. He provided a reproducer, which sets up a timerfd based
timer and then rearms it in a loop with an absolute expiry time of 1ns.

As the expiry time is in the past, the timer ends up as the first expiring
timer in the per CPU hrtimer base and the clockevent device is programmed
with the minimum delta value. If the machine is fast enough, this ends up
in a endless loop of programming the delta value to the minimum value
defined by the clock event device, before the timer interrupt can fire,
which starves the interrupt and consequently triggers the lockup detector
because the hrtimer callback of the lockup mechanism is never invoked.

As a first step to prevent this, avoid reprogramming the clock event device
when:
     - a forced minimum delta event is pending
     - the new expiry delta is less then or equal to the minimum delta

Thanks to Calvin for providing the reproducer and to Borislav for testing
and providing data from his Zen5 machine.

The problem is not limited to Zen5, but depending on the underlying
clock event device (e.g. TSC deadline timer on Intel) and the CPU speed
not necessarily observable.

This change serves only as the last resort and further changes will be made
to prevent this scenario earlier in the call chain as far as possible.

[ tglx: Updated to restore the old behaviour vs. !force and delta <= 0 and
  	fixed up the tick-broadcast handlers as pointed out by Borislav ]

Fixes: d316c57ff6bf ("[PATCH] clockevents: add core functionality")
Reported-by: Calvin Owens 
Signed-off-by: Thomas Gleixner 
Tested-by: Calvin Owens 
Tested-by: Borislav Petkov 
Link: https://lore.kernel.org/lkml/acMe-QZUel-bBYUh@mozart.vkv.me/
Link: https://patch.msgid.link/20260407083247.562657657@kernel.org

Merge tag 'vfs-7.0-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2026-04-10T15:40:49+00:00

Pull vfs fixes from Christian Brauner:
 "The kernfs rbtree is keyed by (hash, ns, name) where the hash
  is seeded with the raw namespace pointer via init_name_hash(ns).

  The resulting hash values are exposed to userspace through
  readdir seek positions, and the pointer-based ordering in
  kernfs_name_compare() is observable through entry order.

  Switch from raw pointers to ns_common::ns_id for both hashing
  and comparison.

  A preparatory commit first replaces all const void * namespace
  parameters with const struct ns_common * throughout kernfs, sysfs,
  and kobject so the code can access ns->ns_id. Also compare the
  ns_id when hashes match in the rbtree to handle crafted collisions.

  Also fix eventpoll RCU grace period issue and a cachefiles refcount
  problem"

* tag 'vfs-7.0-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  kernfs: make directory seek namespace-aware
  kernfs: use namespace id instead of pointer for hashing and comparison
  kernfs: pass struct ns_common instead of const void * for namespace tags
  eventpoll: defer struct eventpoll free to RCU grace period
  cachefiles: fix incorrect dentry refcount in cachefiles_cull()

Merge tag 'pmdomain-v7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm

2026-04-09T18:09:12+00:00

Pull pmdomain fixes from Ulf Hansson:

 - imx: Prevent hang at power down for imx8mp-blk-ctrl

 - thead: Fix buffer overflow for TH1520 AON driver

 - Change Ulf Hansson's email

* tag 'pmdomain-v7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm:
  MAINTAINERS, mailmap: Change Ulf Hansson's email
  pmdomain: imx8mp-blk-ctrl: Keep the NOC_HDCP clock enabled
  firmware: thead: Fix buffer overflow and use standard endian macros

kernfs: pass struct ns_common instead of const void * for namespace tags

2026-04-09T12:36:52+00:00

kernfs has historically used const void * to pass around namespace tags
used for directory-level namespace filtering. The only current user of
this is sysfs network namespace tagging where struct net pointers are
cast to void *.

Replace all const void * namespace parameters with const struct
ns_common * throughout the kernfs, sysfs, and kobject namespace layers.
This includes the kobj_ns_type_operations callbacks, kobject_namespace(),
and all sysfs/kernfs APIs that accept or return namespace tags.

Passing struct ns_common is needed because various codepaths require
access to the underlying namespace. A struct ns_common can always be
converted back to the concrete namespace type (e.g., struct net) via
container_of() or to_ns_common() in the reverse direction.

This is a preparatory change for switching to ns_id-based directory
iteration to prevent a KASLR pointer leak through the current use of
raw namespace pointers as hash seeds and comparison keys.

Signed-off-by: Christian Brauner