linux.git/mm/vmalloc.c, branch v7.2-rc1

mm/vmalloc: free unused pages on vrealloc() shrink

2026-06-02T22:22:32+00:00

When vrealloc() shrinks an allocation and the new size crosses a page
boundary, unmap and free the tail pages that are no longer needed.  This
reclaims physical memory that was previously wasted for the lifetime of
the allocation.

The heuristic is simple: always free when at least one full page becomes
unused.  Huge page allocations (page_order > 0) are skipped, as partial
freeing would require splitting.  Allocations with VM_FLUSH_RESET_PERMS
are also skipped, as their direct-map permissions must be reset before
pages are returned to the page allocator, which is handled by
vm_reset_perms() during vfree().

Additionally, allocations with VM_USERMAP are skipped because
remap_vmalloc_range_partial() validates mapping requests against the
unchanged vm->size; freeing tail pages would cause vmalloc_to_page() to
return NULL for the unmapped range.

To protect concurrent readers, the shrink path uses Node lock to
synchronize before freeing the pages.

Finally, we notify kmemleak of the reduced allocation size using
kmemleak_free_part() to prevent the kmemleak scanner from faulting on the
newly unmapped virtual addresses.

The virtual address reservation (vm->size / vmap_area) is intentionally
kept unchanged, preserving the address for potential future grow-in-place
support.

Link: https://lore.kernel.org/20260519-vmalloc-shrink-v14-4-70b96ee3e9c9@zohomail.in
Signed-off-by: Shivam Kalra 
Suggested-by: Danilo Krummrich 
Reviewed-by: Uladzislau Rezki (Sony) 
Cc: Alice Ryhl 
Signed-off-by: Andrew Morton

mm/vmalloc: use physical page count in vread_iter() for VM_ALLOC areas

2026-06-02T22:22:32+00:00

For VM_ALLOC areas in vread_iter(), derive the vm area size from
vm->nr_pages rather than get_vm_area_size().

Only VM_ALLOC areas are subject to vrealloc() shrinking, which frees pages
without reducing the virtual reservation size.  Switch to using
vm->nr_pages for VM_ALLOC areas so the reader remains correct once shrink
support is added.  Other mapping types (vmap, ioremap) do not initialize
nr_pages and will continue using get_vm_area_size().

[shivamkalra98@zohomail.in: add an nr_pages check]
  Link: https://lore.kernel.org/aff47da5-4fd5-481d-be18-e1eb99639490@zohomail.in
Link: https://lore.kernel.org/20260519-vmalloc-shrink-v14-3-70b96ee3e9c9@zohomail.in
Signed-off-by: Shivam Kalra 
Reviewed-by: Uladzislau Rezki (Sony) 
Cc: Alice Ryhl 
Cc: Danilo Krummrich 
Signed-off-by: Andrew Morton

mm/vmalloc: use physical page count for vrealloc() grow-in-place check

2026-06-02T22:22:32+00:00

Update the grow-in-place check in vrealloc() to compare the requested size
against the actual physical page count (vm->nr_pages) rather than the
virtual area size (alloced_size, derived from get_vm_area_size()).

Currently both values are equivalent, but the upcoming vrealloc() shrink
functionality will free pages without reducing the virtual reservation
size.  After such a shrink, the old alloced_size-based comparison would
incorrectly allow a grow-in-place operation to succeed and attempt to
access freed pages.  Switch to vm->nr_pages now so the check remains
correct once shrink support is added.

Link: https://lore.kernel.org/20260519-vmalloc-shrink-v14-2-70b96ee3e9c9@zohomail.in
Signed-off-by: Shivam Kalra 
Reviewed-by: Uladzislau Rezki (Sony) 
Cc: Alice Ryhl 
Cc: Danilo Krummrich 
Signed-off-by: Andrew Morton

mm/vmalloc: extract vm_area_free_pages() helper from vfree()

2026-06-02T22:22:31+00:00

Patch series "mm/vmalloc: free unused pages on vrealloc() shrink", v14.

This series implements the TODO in vrealloc() to unmap and free unused
pages when shrinking across a page boundary.

Problem:
When vrealloc() shrinks an allocation, it updates bookkeeping
(requested_size, KASAN shadow) but does not free the underlying physical
pages. This wastes memory for the lifetime of the allocation.

Solution:
- Patch 1: Extracts a vm_area_free_pages(vm, start_idx, end_idx) helper
  from vfree() that frees a range of pages with memcg and nr_vmalloc_pages
  accounting. Freed page pointers are set to NULL to prevent stale
  references.
- Patch 2: Update the grow-in-place check in vrealloc() to compare the
  requested size against the actual physical page count (vm->nr_pages) 
  rather than the virtual area sizes. This is a prerequisite for shrinking.
- Patch 3: For VM_ALLOC areas in vread_iter(), derive the vm area size 
  from vm->nr_pages rather than get_vm_area_size(), which would 
  overestimate the mapped range after a shrink. Other mapping types
  (vmap, ioremap) don't set nr_pages and keep using get_vm_area_size().
- Patch 4: Uses the helper to free tail pages when vrealloc() shrinks
  across a page boundary. 
- Patch 5: Adds a vrealloc test case to lib/test_vmalloc that exercises
  grow-realloc, shrink-across-boundary, shrink-within-page, and
  grow-in-place paths.

The virtual address reservation is kept intact to preserve the range for
potential future grow-in-place support.  A concrete user is the Rust
binder driver's KVVec::shrink_to [1], which performs explicit vrealloc()
shrinks for memory reclamation.


This patch (of 5):

Extract page freeing and NR_VMALLOC stat accounting from vfree() into a
reusable vm_area_free_pages() helper.  The helper operates on a range
[start_idx, end_idx) of pages from a vm_struct, making it suitable for
both full free (vfree) and partial free (upcoming vrealloc shrink).

Freed page pointers in vm->pages[] are set to NULL to prevent stale
references when the vm_struct outlives the free (as in vrealloc shrink).

Link: https://lore.kernel.org/20260519-vmalloc-shrink-v14-0-70b96ee3e9c9@zohomail.in
Link: https://lore.kernel.org/20260519-vmalloc-shrink-v14-1-70b96ee3e9c9@zohomail.in
Link: https://lore.kernel.org/all/20260216-binder-shrink-vec-v3-v6-0-ece8e8593e53@zohomail.in/ [1]
Signed-off-by: Shivam Kalra 
Reviewed-by: Uladzislau Rezki (Sony) 
Cc: Alice Ryhl 
Cc: Danilo Krummrich 
Signed-off-by: Andrew Morton

vmalloc: add __GFP_SKIP_KASAN support

2026-05-29T04:05:00+00:00

Patch series "kasan: hw_tags: Disable tagging for stack and page-tables",
v4.

Stacks and page tables are always accessed with the match-all tag, so
assigning a new random tag every time at allocation and setting invalid
tag at deallocation time, just adds overhead without improving the
detection.

With __GFP_SKIP_KASAN the page keeps its poison tag and KASAN_TAG_KERNEL
(match-all tag) is stored in the page flags while keeping the poison tag
in the hardware.  The benefit of it is that 256 tag setting instruction
per 4 kB page aren't needed at allocation and deallocation time.

Thus match-all pointers still work, while non-match tags (other than
poison tag) still fault.

__GFP_SKIP_KASAN only skips for KASAN_HW_TAGS mode, so coverage is
unchanged.

Benchmark:
The benchmark has two modes. In thread mode, the child process forks
and creates N threads. In pgtable mode, the parent maps and faults a
specified memory size and then forks repeatedly with children exiting
immediately.

Thread benchmark:
2000 iterations, 2000 threads:	2.575 s → 2.229 s (~13.4% faster)

The pgtable samples:
- 2048 MB, 2000 iters		19.08 s → 17.62 s (~7.6% faster)


This patch (of 3):

For allocations that will be accessed only with match-all pointers (e.g.,
kernel stacks), setting tags is wasted work.  If the caller already set
__GFP_SKIP_KASAN, skip tag setting of vmalloc pages.

Before this patch, __GFP_SKIP_KASAN wasn't being used with vmalloc APIs. 
So it wasn't being checked.  Now its being checked and acted upon.  Other
KASAN modes are unchanged because __GFP_SKIP_KASAN is ignored for them in
the page allocator, and in vmalloc too we ignore this flag for them.

This is a preparatory patch for optimizing kernel stack allocations.

Link: https://lore.kernel.org/20260429102704.680174-1-dev.jain@arm.com
Link: https://lore.kernel.org/20260429102704.680174-2-dev.jain@arm.com
Signed-off-by: Muhammad Usama Anjum 
Co-developed-by: Ryan Roberts 
Signed-off-by: Ryan Roberts 
Co-developed-by: Dev Jain 
Signed-off-by: Dev Jain 
Reviewed-by: Catalin Marinas 
Cc: Arnd Bergmann 
Cc: Ben Segall 
Cc: David Hildenbrand 
Cc: Dietmar Eggemann 
Cc: Ingo Molnar 
Cc: Juri Lelli 
Cc: Kees Cook 
Cc: K Prateek Nayak 
Cc: Liam Howlett 
Cc: Lorenzo Stoakes 
Cc: Mathieu Desnoyers 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Mike Rapoport 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Cc: Suren Baghdasaryan 
Cc: "Uladzislau Rezki (Sony)" 
Cc: Valentin Schneider 
Cc: Vincent Guittot 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton

vmalloc: optimize vfree with free_pages_bulk()

2026-05-29T04:04:40+00:00

Whenever vmalloc allocates high order pages (e.g.  for a huge mapping) it
must immediately split_page() to order-0 so that it remains compatible
with users that want to access the underlying struct page.  Commit
a06157804399 ("mm/vmalloc: request large order pages from buddy
allocator") recently made it much more likely for vmalloc to allocate high
order pages which are subsequently split to order-0.

Unfortunately this had the side effect of causing performance regressions
for tight vmalloc/vfree loops (e.g.  test_vmalloc.ko benchmarks).  See
Closes: tag.  This happens because the high order pages must be gotten
from the buddy but then because they are split to order-0, when they are
freed they are freed to the order-0 pcp.  Previously allocation was for
order-0 pages so they were recycled from the pcp.

It would be preferable if when vmalloc allocates an (e.g.) order-3 page
that it also frees that order-3 page to the order-3 pcp, then the
regression could be removed.

So let's do exactly that; update stats separately first as coalescing is
hard to do correctly without complexity.  Use free_pages_bulk() which uses
the new __free_contig_range() API to batch-free contiguous ranges of pfns.
This not only removes the regression, but significantly improves
performance of vfree beyond the baseline.

A selection of test_vmalloc benchmarks running on arm64 server class
system.  mm-new is the baseline.  Commit a06157804399 ("mm/vmalloc:
request large order pages from buddy allocator") was added in v6.19-rc1
where we see regressions.  Then with this change performance is much
better.  (>0 is faster, <0 is slower, (R)/(I) = statistically significant
Regression/Improvement):

+-----------------+----------------------------------------------------------+-------------------+--------------------+
| Benchmark       | Result Class                                             |   mm-new          |  this series       |
+=================+==========================================================+===================+====================+
| micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |        1331843.33 |         (I) 67.17% |
|                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |         415907.33 |             -5.14% |
|                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |         755448.00 |         (I) 53.55% |
|                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |        1591331.33 |         (I) 57.26% |
|                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |        1594345.67 |         (I) 68.46% |
|                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |        1071826.00 |         (I) 79.27% |
|                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |        1018385.00 |         (I) 84.17% |
|                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |        3970899.67 |         (I) 77.01% |
|                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |        3821788.67 |         (I) 89.44% |
|                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |        7795968.00 |         (I) 82.67% |
|                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |        6530169.67 |        (I) 118.09% |
|                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |         626808.33 |             -0.98% |
|                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         532145.67 |             -1.68% |
|                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         537032.67 |             -0.96% |
|                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |        8805069.00 |         (I) 74.58% |
|                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |         500824.67 |              4.35% |
|                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |        1637554.67 |         (I) 76.99% |
|                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |        4556288.67 |         (I) 72.23% |
|                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |         107371.00 |             -0.70% |
+-----------------+----------------------------------------------------------+-------------------+--------------------+

Link: https://lore.kernel.org/20260401101634.2868165-3-usama.anjum@arm.com
Fixes: a06157804399 ("mm/vmalloc: request large order pages from buddy allocator")
Closes: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com/
Signed-off-by: Ryan Roberts 
Co-developed-by: Muhammad Usama Anjum 
Signed-off-by: Muhammad Usama Anjum 
Acked-by: Vlastimil Babka (SUSE) 
Acked-by: Zi Yan 
Acked-by: David Hildenbrand (Arm) 
Reviewed-by: Uladzislau Rezki (Sony) 
Cc: Brendan Jackman 
Cc: David Sterba 
Cc: Johannes Weiner 
Cc: Liam Howlett 
Cc: Lorenzo Stoakes 
Cc: Michal Hocko 
Cc: Mike Rapoport 
Cc: Nick Terrell 
Cc: Suren Baghdasaryan 
Cc: Vishal Moola (Oracle) 
Signed-off-by: Andrew Morton

mm/vmalloc: do not trigger BUG() on BH disabled context

2026-05-22T02:06:13+00:00

__get_vm_area_node() currently triggers a BUG() if in_interrupt() returns
true.  However, in_interrupt() also reports true when BH are disabled.

The bridge code can call rhashtable_lookup_insert_fast() with bottom
halves disabled:

__vlan_add()
 -> br_fdb_add_local()
  spin_lock_bh(&br->hash_lock); <-- Disable BH
   -> fdb_add_local()
    -> fdb_create()
     -> rhashtable_lookup_insert_fast()
      -> kvmalloc()
       -> vmalloc()
        -> __get_vm_area_node()
         -> BUG_ON(in_interrupt())
  spin_unlock_bh(&br->hash_lock)

this triggers the BUG() despite the caller not being in NMI or
hard IRQ context.

Replace the in_interrupt() check with in_nmi() || in_hardirq().

Link: https://lore.kernel.org/20260515153009.2296191-1-urezki@gmail.com
Fixes: c6307674ed82 ("mm: kvmalloc: add non-blocking support for vmalloc")
Signed-off-by: Uladzislau Rezki (Sony) 
Cc: Ido Schimmel 
Reported-by: syzbot+8b12fc6e0fb139765b58@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69ff8c7c.050a0220.1036b8.000b.GAE@google.com/
Reviewed-by: Baoquan He 
Cc: 
Signed-off-by: Andrew Morton

vmalloc: fix buffer overflow in vrealloc_node_align()

2026-04-27T12:54:23+00:00

Commit 4c5d3365882d ("mm/vmalloc: allow to set node and align in
vrealloc") added the ability to force a new allocation if the current
pointer is on the wrong NUMA node, or if an alignment constraint is not
met, even if the user is shrinking the allocation.

On this path (need_realloc), the code allocates a new object of 'size'
bytes and then memcpy()s 'old_size' bytes into it.  If the request is to
shrink the object (size < old_size), this results in an out-of-bounds
write on the new buffer.

Fix this by bounding the copy length by the new allocation size.

Link: https://lore.kernel.org/20260420114805.3572606-2-elver@google.com
Fixes: 4c5d3365882d ("mm/vmalloc: allow to set node and align in vrealloc")
Signed-off-by: Marco Elver 
Reported-by: Harry Yoo (Oracle) 
Reviewed-by: Uladzislau Rezki (Sony) 
Acked-by: Vlastimil Babka (SUSE) 
Reviewed-by: Harry Yoo (Oracle) 
Cc: 
Signed-off-by: Andrew Morton

Merge tag 'mm-hotfixes-stable-2026-04-19-00-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

2026-04-19T21:45:37+00:00

Pull MM fixes from Andrew Morton:
 "7 hotfixes. 6 are cc:stable and all are for MM. Please see the
  individual changelogs for details"

* tag 'mm-hotfixes-stable-2026-04-19-00-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm/damon/core: disallow non-power of two min_region_sz on damon_start()
  mm/vmalloc: take vmap_purge_lock in shrinker
  mm: call ->free_folio() directly in folio_unmap_invalidate()
  mm: blk-cgroup: fix use-after-free in cgwb_release_workfn()
  mm/zone_device: do not touch device folio after calling ->folio_free()
  mm/damon/core: disallow time-quota setting zero esz
  mm/mempolicy: fix weighted interleave auto sysfs name

mm/vmalloc: take vmap_purge_lock in shrinker

2026-04-19T06:24:27+00:00

decay_va_pool_node() can be invoked concurrently from two paths:
__purge_vmap_area_lazy() when pools are being purged, and the shrinker via
vmap_node_shrink_scan().

However, decay_va_pool_node() is not safe to run concurrently, and the
shrinker path currently lacks serialization, leading to races and possible
leaks.

Protect decay_va_pool_node() by taking vmap_purge_lock in the shrinker
path to ensure serialization with purge users.

Link: https://lore.kernel.org/20260413192646.14683-1-urezki@gmail.com
Fixes: 7679ba6b36db ("mm: vmalloc: add a shrinker to drain vmap pools")
Signed-off-by: Uladzislau Rezki (Sony) 
Reviewed-by: Baoquan He 
Cc: chenyichong 
Cc: 
Signed-off-by: Andrew Morton