linux-stable.git/mm, branch v6.12.89

mm/damon/sysfs-schemes: protect memcg_path kfree() with damon_sysfs_lock

2026-05-14T13:29:25+00:00

commit 1e68eb96e8beb1abefd12dd22c5637795d8a877e upstream.

Patch series "mm/damon/sysfs-schemes: fix use-after-free for [memcg_]path".

Reads of 'memcg_path' and 'path' files in DAMON sysfs interface could race
with their writes, results in use-after-free.  Fix those.


This patch (of 2):

damon_sysfs_scheme_filter->mmecg_path can be read and written by users,
via DAMON sysfs memcg_path file.  It can also be indirectly read, for the
parameters {on,off}line committing to DAMON.  The reads for parameters
committing are protected by damon_sysfs_lock to avoid the sysfs files
being destroyed while any of the parameters are being read.  But the
user-driven direct reads and writes are not protected by any lock, while
the write is deallocating the memcg_path-pointing buffer.  As a result,
the readers could read the already freed buffer (user-after-free).  Note
that the user-reads don't race when the same open file is used by the
writer, due to kernfs's open file locking.  Nonetheless, doing the reads
and writes with separate open files would be common.  Fix it by protecting
both the user-direct reads and writes with damon_sysfs_lock.

Link: https://lore.kernel.org/20260423150253.111520-1-sj@kernel.org
Link: https://lore.kernel.org/20260423150253.111520-2-sj@kernel.org
Fixes: 4f489fe6afb3 ("mm/damon/sysfs-schemes: free old damon_sysfs_scheme_filter->memcg_path on write")
Co-developed-by: Junxi Qian 
Signed-off-by: Junxi Qian 
Signed-off-by: SeongJae Park 
Cc:  # 6.16.x
Signed-off-by: Andrew Morton 
Signed-off-by: Greg Kroah-Hartman

mm: convert mm_lock_seq to a proper seqcount

2026-05-14T13:29:17+00:00

[ Upstream commit eb449bd96954b1c1e491d19066cfd2a010f0aa47 ]

Convert mm_lock_seq to be seqcount_t and change all mmap_write_lock
variants to increment it, in-line with the usual seqcount usage pattern.
This lets us check whether the mmap_lock is write-locked by checking
mm_lock_seq.sequence counter (odd=locked, even=unlocked). This will be
used when implementing mmap_lock speculation functions.
As a result vm_lock_seq is also change to be unsigned to match the type
of mm_lock_seq.sequence.

Suggested-by: Peter Zijlstra 
Signed-off-by: Suren Baghdasaryan 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Liam R. Howlett 
Link: https://lkml.kernel.org/r/20241122174416.1367052-2-surenb@google.com
Stable-dep-of: 52f657e34d7b ("x86: shadow stacks: proper error handling for mmap lock")
Signed-off-by: Sasha Levin

mm: prevent droppable mappings from being locked

2026-05-07T04:09:47+00:00

[ Upstream commit d239462787b072c78eb19fc1f155c3d411256282 ]

Droppable mappings must not be lockable.  There is a check for VMAs with
VM_DROPPABLE set in mlock_fixup() along with checks for other types of
unlockable VMAs which ensures this when calling mlock()/mlock2().

For mlockall(MCL_FUTURE), the check for unlockable VMAs is different.  In
apply_mlockall_flags(), if the flags parameter has MCL_FUTURE set, the
current task's mm's default VMA flag field mm->def_flags has VM_LOCKED
applied to it.  VM_LOCKONFAULT is also applied if MCL_ONFAULT is also set.
When these flags are set as default in this manner they are cleared in
__mmap_complete() for new mappings that do not support mlock.  A check for
VM_DROPPABLE in __mmap_complete() is missing resulting in droppable
mappings created with VM_LOCKED set.  To fix this and reduce that chance
of similar bugs in the future, introduce and use vma_supports_mlock().

Link: https://lkml.kernel.org/r/20260310155821.17869-1-anthony.yznaga@oracle.com
Fixes: 9651fcedf7b9 ("mm: add MAP_DROPPABLE for designating always lazily freeable mappings")
Signed-off-by: Anthony Yznaga 
Suggested-by: David Hildenbrand 
Acked-by: David Hildenbrand (Arm) 
Reviewed-by: Pedro Falcato 
Reviewed-by: Lorenzo Stoakes (Oracle) 
Tested-by: Lorenzo Stoakes (Oracle) 
Cc: Jann Horn 
Cc: Jason A. Donenfeld 
Cc: Liam Howlett 
Cc: Michal Hocko 
Cc: Mike Rapoport 
Cc: Shuah Khan 
Cc: Suren Baghdasaryan 
Cc: Vlastimil Babka 
Cc: 
Signed-off-by: Andrew Morton 
[ adapted change to `mm/mmap.c::__mmap_region()` instead of `mm/vma.c::__mmap_complete()` ]
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

mm/zsmalloc: copy KMSAN metadata in zs_page_migrate()

2026-05-07T04:09:45+00:00

[ Upstream commit 4fb61d95ad21c3b6f1c09f357ff49d70abb0535e ]

zs_page_migrate() uses copy_page() to copy the contents of a zspage page
during migration.  However, copy_page() is not instrumented by KMSAN, so
the shadow and origin metadata of the destination page are not updated.

As a result, subsequent accesses to the migrated page are reported as
use-after-free by KMSAN, despite the data being correctly copied.

Add a kmsan_copy_page_meta() call after copy_page() to propagate the KMSAN
metadata to the new page, matching what copy_highpage() does internally.

Link: https://lkml.kernel.org/r/20260321132912.93434-1-syoshida@redhat.com
Fixes: afb2d666d025 ("zsmalloc: use copy_page for full page copy")
Signed-off-by: Shigeru Yoshida 
Reviewed-by: Sergey Senozhatsky 
Cc: Mark-PK Tsai 
Cc: Minchan Kim 
Cc: 
Signed-off-by: Andrew Morton 
[ translated zpdesc_page(newzpdesc/zpdesc) arguments to newpage/page ]
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

mm: migrate: requeue destination folio on deferred split queue

2026-05-07T04:09:44+00:00

[ Upstream commit a2e0c0668a3486f96b86c50e02872c8e94fd4f9c ]

During folio migration, __folio_migrate_mapping() removes the source folio
from the deferred split queue, but the destination folio is never
re-queued.  This causes underutilized THPs to escape the shrinker after
NUMA migration, since they silently drop off the deferred split list.

Fix this by recording whether the source folio was on the deferred split
queue and its partially mapped state before move_to_new_folio() unqueues
it, and re-queuing the destination folio after a successful migration if
it was.

By the time migrate_folio_move() runs, partially mapped folios without a
pin have already been split by migrate_pages_batch().  So only two cases
remain on the deferred list at this point:
  1. Partially mapped folios with a pin (split failed).
  2. Fully mapped but potentially underused folios.  The recorded
     partially_mapped state is forwarded to deferred_split_folio() so that
     the destination folio is correctly re-queued in both cases.

Because THPs are removed from the deferred_list, THP shinker cannot
split the underutilized THPs in time.  As a result, users will show
less free memory than before.

Link: https://lkml.kernel.org/r/20260312104723.1351321-1-usama.arif@linux.dev
Fixes: dafff3f4c850 ("mm: split underused THPs")
Signed-off-by: Usama Arif 
Reported-by: Johannes Weiner 
Acked-by: Johannes Weiner 
Acked-by: Zi Yan 
Acked-by: David Hildenbrand (Arm) 
Acked-by: SeongJae Park 
Reviewed-by: Wei Yang 
Cc: Alistair Popple 
Cc: Byungchul Park 
Cc: Gregory Price 
Cc: "Huang, Ying" 
Cc: Joshua Hahn 
Cc: Matthew Brost 
Cc: Matthew Wilcox (Oracle) 
Cc: Nico Pache 
Cc: Rakie Kim 
Cc: Ying Huang 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

mm/migrate: move movable_ops page handling out of move_to_new_folio()

2026-05-07T04:09:44+00:00

[ Upstream commit be4a3e9c185264e9ad0fe02c1c5d81b8386bd50c ]

Let's move that handling directly into migrate_folio_move(), so we can
simplify move_to_new_folio().  While at it, fixup the documentation a bit.

Note that unmap_and_move_huge_page() does not care, because it only deals
with actual folios.  (we only support migration of individual movable_ops
pages)

Link: https://lkml.kernel.org/r/20250704102524.326966-12-david@redhat.com
Signed-off-by: David Hildenbrand 
Reviewed-by: Zi Yan 
Reviewed-by: Harry Yoo 
Reviewed-by: Lorenzo Stoakes 
Cc: Alistair Popple 
Cc: Al Viro 
Cc: Arnd Bergmann 
Cc: Brendan Jackman 
Cc: Byungchul Park 
Cc: Chengming Zhou 
Cc: Christian Brauner 
Cc: Christophe Leroy 
Cc: Eugenio Pé rez 
Cc: Greg Kroah-Hartman 
Cc: Gregory Price 
Cc: "Huang, Ying" 
Cc: Jan Kara 
Cc: Jason Gunthorpe 
Cc: Jason Wang 
Cc: Jerrin Shaji George 
Cc: Johannes Weiner 
Cc: John Hubbard 
Cc: Jonathan Corbet 
Cc: Joshua Hahn 
Cc: Liam Howlett 
Cc: Madhavan Srinivasan 
Cc: Mathew Brost 
Cc: Matthew Wilcox (Oracle) 
Cc: Miaohe Lin 
Cc: Michael Ellerman 
Cc: "Michael S. Tsirkin" 
Cc: Michal Hocko 
Cc: Mike Rapoport 
Cc: Minchan Kim 
Cc: Naoya Horiguchi 
Cc: Nicholas Piggin 
Cc: Oscar Salvador 
Cc: Peter Xu 
Cc: Qi Zheng 
Cc: Rakie Kim 
Cc: Rik van Riel 
Cc: Sergey Senozhatsky 
Cc: Shakeel Butt 
Cc: Suren Baghdasaryan 
Cc: Vlastimil Babka 
Cc: Xuan Zhuo 
Cc: xu xin 
Signed-off-by: Andrew Morton 
Stable-dep-of: a2e0c0668a34 ("mm: migrate: requeue destination folio on deferred split queue")
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

mm/migrate: factor out movable_ops page handling into migrate_movable_ops_page()

2026-05-07T04:09:44+00:00

[ Upstream commit b9ed00483d4cbacca04edb11984d8daf09e9ae22 ]

Let's factor it out, simplifying the calling code.

Before this change, we would have called flush_dcache_folio() also on
movable_ops pages.  As documented in Documentation/core-api/cachetlb.rst:

	"This routine need only be called for page cache pages which can
	 potentially ever be mapped into the address space of a user
	 process."

So don't do it for movable_ops pages.  If there would ever be such a
movable_ops page user, it should do the flushing itself after performing
the copy.

Note that we can now change folio_mapping_flags() to folio_test_anon() to
make it clearer, because movable_ops pages will never take that path.

[akpm@linux-foundation.org: fix kerneldoc]
Link: https://lkml.kernel.org/r/20250704102524.326966-10-david@redhat.com
Signed-off-by: David Hildenbrand 
Reviewed-by: Zi Yan 
Reviewed-by: Lorenzo Stoakes 
Cc: Alistair Popple 
Cc: Al Viro 
Cc: Arnd Bergmann 
Cc: Brendan Jackman 
Cc: Byungchul Park 
Cc: Chengming Zhou 
Cc: Christian Brauner 
Cc: Christophe Leroy 
Cc: Eugenio Pé rez 
Cc: Greg Kroah-Hartman 
Cc: Gregory Price 
Cc: Harry Yoo 
Cc: "Huang, Ying" 
Cc: Jan Kara 
Cc: Jason Gunthorpe 
Cc: Jason Wang 
Cc: Jerrin Shaji George 
Cc: Johannes Weiner 
Cc: John Hubbard 
Cc: Jonathan Corbet 
Cc: Joshua Hahn 
Cc: Liam Howlett 
Cc: Madhavan Srinivasan 
Cc: Mathew Brost 
Cc: Matthew Wilcox (Oracle) 
Cc: Miaohe Lin 
Cc: Michael Ellerman 
Cc: "Michael S. Tsirkin" 
Cc: Michal Hocko 
Cc: Mike Rapoport 
Cc: Minchan Kim 
Cc: Naoya Horiguchi 
Cc: Nicholas Piggin 
Cc: Oscar Salvador 
Cc: Peter Xu 
Cc: Qi Zheng 
Cc: Rakie Kim 
Cc: Rik van Riel 
Cc: Sergey Senozhatsky 
Cc: Shakeel Butt 
Cc: Suren Baghdasaryan 
Cc: Vlastimil Babka 
Cc: Xuan Zhuo 
Cc: xu xin 
Signed-off-by: Andrew Morton 
Stable-dep-of: a2e0c0668a34 ("mm: migrate: requeue destination folio on deferred split queue")
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

mm/damon/core: use time_in_range_open() for damos quota window start

2026-05-07T04:09:36+00:00

commit 049a57421dd67a28c45ae7e92c36df758033e5fa upstream.

damos_adjust_quota() uses time_after_eq() to show if it is time to start a
new quota charge window, comparing the current jiffies and the scheduled
next charge window start time.  If it is, the next charge window start
time is updated and the new charge window starts.

The time check and next window start time update is skipped while the
scheme is deactivated by the watermarks.  Let's suppose the deactivation
is kept more than LONG_MAX jiffies (assuming CONFIG_HZ of 250, more than
99 days in 32 bit systems and more than one billion years in 64 bit
systems), resulting in having the jiffies larger than the next charge
window start time + LONG_MAX.  Then, the time_after_eq() call can return
false until another LONG_MAX jiffies are passed.

This means the scheme can continue working after being reactivated by the
watermarks.  But, soon, the quota will be exceeded and the scheme will
again effectively stop working until the next charge window starts.
Because the current charge window is extended to up to LONG_MAX jiffies,
however, it will look like it stopped unexpectedly and indefinitely, from
the user's perspective.

Fix this by using !time_in_range_open() instead.

The issue was discovered [1] by sashiko.

Link: https://lore.kernel.org/20260329152306.45796-1-sj@kernel.org
Link: https://lore.kernel.org/20260324040722.57944-1-sj@kernel.org [1]
Fixes: ee801b7dd782 ("mm/damon/schemes: activate schemes based on a watermarks mechanism")
Signed-off-by: SeongJae Park 
Cc:  # 5.16.x
Signed-off-by: Andrew Morton 
Signed-off-by: Greg Kroah-Hartman

mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range()

2026-05-07T04:09:27+00:00

commit 397f6d14f9c370e4910e6885294c340f39dedbf5 upstream.

In do_migrate_range(), the hwpoisoned folio may be large folio, which
can't be handled by unmap_poisoned_folio().

I can reproduce this issue in qemu after adding delay in memory_failure()

BUG: kernel NULL pointer dereference, address: 0000000000000000
Workqueue: kacpi_hotplug acpi_hotplug_work_fn
RIP: 0010:try_to_unmap_one+0x16a/0xfc0
  
  rmap_walk_anon+0xda/0x1f0
  try_to_unmap+0x78/0x80
  ? __pfx_try_to_unmap_one+0x10/0x10
  ? __pfx_folio_not_mapped+0x10/0x10
  ? __pfx_folio_lock_anon_vma_read+0x10/0x10
  unmap_poisoned_folio+0x60/0x140
  do_migrate_range+0x4d1/0x600
  ? slab_memory_callback+0x6a/0x190
  ? notifier_call_chain+0x56/0xb0
  offline_pages+0x3e6/0x460
  memory_subsys_offline+0x130/0x1f0
  device_offline+0xba/0x110
  acpi_bus_offline+0xb7/0x130
  acpi_scan_hot_remove+0x77/0x290
  acpi_device_hotplug+0x1e0/0x240
  acpi_hotplug_work_fn+0x1a/0x30
  process_one_work+0x186/0x340

Besides, do_migrate_range() may be called between memory_failure set
hwpoison flag and isolate the folio from lru, so remove WARN_ON(). In other
places, unmap_poisoned_folio() is called when the folio is isolated, obey
it in do_migrate_range() too.

[david@redhat.com: don't abort offlining, fixed typo, add comment]
Link: https://lkml.kernel.org/r/3c214dff-9649-4015-840f-10de0e03ebe4@redhat.com
Fixes: b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined")
Signed-off-by: Jinjiang Tu 
Signed-off-by: David Hildenbrand 
Acked-by: Zi Yan 
Reviewed-by: Miaohe Lin 
Cc: Kefeng Wang 
Cc: Luis Chamberalin 
Cc: Matthew Wilcox (Oracle) 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Pankaj Raghav 
Signed-off-by: Andrew Morton 
Signed-off-by: Alexandra Diupina 
Signed-off-by: Sasha Levin

mm/pagewalk: fix race between concurrent split and refault

2026-04-27T13:24:24+00:00

[ Upstream commit 9b25a6e3d243a8ce14eeaf74082c621a9944c776 ]

The splitting of a PUD entry in walk_pud_range() can race with a
concurrent thread refaulting the PUD leaf entry causing it to try walking
a PMD range that has disappeared.

An example and reproduction of this is to try reading numa_maps of a
process while VFIO-PCI is setting up DMA (specifically the
vfio_pin_pages_remote call) on a large BAR for that process.

This will trigger a kernel BUG:
vfio-pci 0000:03:00.0: enabling device (0000 -> 0002)
BUG: unable to handle page fault for address: ffffa23980000000
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP NOPTI
...
RIP: 0010:walk_pgd_range+0x3b5/0x7a0
Code: 8d 43 ff 48 89 44 24 28 4d 89 ce 4d 8d a7 00 00 20 00 48 8b 4c 24
28 49 81 e4 00 00 e0 ff 49 8d 44 24 ff 48 39 c8 4c 0f 43 e3 <49> f7 06
   9f ff ff ff 75 3b 48 8b 44 24 20 48 8b 40 28 48 85 c0 74
RSP: 0018:ffffac23e1ecf808 EFLAGS: 00010287
RAX: 00007f44c01fffff RBX: 00007f4500000000 RCX: 00007f44ffffffff
RDX: 0000000000000000 RSI: 000ffffffffff000 RDI: ffffffff93378fe0
RBP: ffffac23e1ecf918 R08: 0000000000000004 R09: ffffa23980000000
R10: 0000000000000020 R11: 0000000000000004 R12: 00007f44c0200000
R13: 00007f44c0000000 R14: ffffa23980000000 R15: 00007f44c0000000
FS:  00007fe884739580(0000) GS:ffff9b7d7a9c0000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffa23980000000 CR3: 000000c0650e2005 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
 
 __walk_page_range+0x195/0x1b0
 walk_page_vma+0x62/0xc0
 show_numa_map+0x12b/0x3b0
 seq_read_iter+0x297/0x440
 seq_read+0x11d/0x140
 vfs_read+0xc2/0x340
 ksys_read+0x5f/0xe0
 do_syscall_64+0x68/0x130
 ? get_page_from_freelist+0x5c2/0x17e0
 ? mas_store_prealloc+0x17e/0x360
 ? vma_set_page_prot+0x4c/0xa0
 ? __alloc_pages_noprof+0x14e/0x2d0
 ? __mod_memcg_lruvec_state+0x8d/0x140
 ? __lruvec_stat_mod_folio+0x76/0xb0
 ? __folio_mod_stat+0x26/0x80
 ? do_anonymous_page+0x705/0x900
 ? __handle_mm_fault+0xa8d/0x1000
 ? __count_memcg_events+0x53/0xf0
 ? handle_mm_fault+0xa5/0x360
 ? do_user_addr_fault+0x342/0x640
 ? arch_exit_to_user_mode_prepare.constprop.0+0x16/0xa0
 ? irqentry_exit_to_user_mode+0x24/0x100
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fe88464f47e
Code: c0 e9 b6 fe ff ff 50 48 8d 3d be 07 0b 00 e8 69 01 02 00 66 0f 1f
84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00
   f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
RSP: 002b:00007ffe6cd9a9b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fe88464f47e
RDX: 0000000000020000 RSI: 00007fe884543000 RDI: 0000000000000003
RBP: 00007fe884543000 R08: 00007fe884542010 R09: 0000000000000000
R10: fffffffffffffbc5 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
 

Fix this by validating the PUD entry in walk_pmd_range() using a stable
snapshot (pudp_get()).  If the PUD is not present or is a leaf, retry the
walk via ACTION_AGAIN instead of descending further.  This mirrors the
retry logic in walk_pte_range(), which lets walk_pmd_range() retry if the
PTE is not being got by pte_offset_map_lock().

Link: https://lkml.kernel.org/r/20260325-pagewalk-check-pmd-refault-v2-1-707bff33bc60@akamai.com
Fixes: f9e54c3a2f5b ("vfio/pci: implement huge_fault support")
Co-developed-by: David Hildenbrand (Arm) 
Signed-off-by: David Hildenbrand (Arm) 
Signed-off-by: Max Boone 
Acked-by: David Hildenbrand (Arm) 
Cc: Liam Howlett 
Cc: Lorenzo Stoakes (Oracle) 
Cc: Michal Hocko 
Cc: Mike Rapoport 
Cc: Suren Baghdasaryan 
Cc: Vlastimil Babka 
Cc: 
Signed-off-by: Andrew Morton 
[ Context ]
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman