summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
3 daysMerge tag 'block-7.2-20260625' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - blk-cgroup locking rework and fixes: - fix a use-after-free in __blkcg_rstat_flush() - defer freeing policy data until after an RCU grace period - defer the blkcg css_put until the blkg is unlinked from the queue - unwind the queue_lock nesting under RCU / blkcg->lock across the lookup, create, associate and destroy paths - NVMe fixes via Keith: - Fix a crash and memory leak during invalid cdev teardown, and related cdev cleanups (Maurizio, John) - nvmet fixes: handle TCP_CLOSING in the tcp state_change handler, reject short AUTH_RECEIVE buffers, handle inline data with a nonzero offset in rdma, fix an sq refcount leak, and allocate ana_state with the port (Maurizio, Michael, Bryam, Wentao, Rosen) - nvme-fc fix to not cancel requests on an IO target before it is initialized (Mohamed) - nvme-apple fix to prevent shared tags across queues on Apple A11 (Nick) - Various smaller fixes and cleanups (John) - MD fixes via Yu Kuai: - raid1/raid10 fixes for writes_pending and barrier reference leaks on write and discard failures, plus REQ_NOWAIT handling fixes (Abd-Alrhman) - raid5 discard accounting and validation, and a batch of fixes for stripe batch races (Yu Kuai, Chen) - Protect raid1 head_position during read balancing (Chen) - block bio-integrity fixes: correct an error injection static key decrement, fix GFP flag confusion in bio_integrity_alloc_buf(), and handle REQ_OP_ZONE_APPEND in __bio_integrity_action() (Christoph) - Fixes for bio_iov_iter_bounce_write(): revert the iov_iter after a short copy, and respect the iov_iter nofault flag (Qu) - Invalidate the cached plug timestamp after a task switch, and clear PF_BLOCK_TS in copy_process() (Usama) - Fix the IORING_URING_CMD_REISSUE flags check in blkdev_uring_cmd() (Yitang) - Remove a redundant plug in __submit_bio() (Wen) - Don't warn when reclassifying a busy socket lock in nbd (Deepanshu) * tag 'block-7.2-20260625' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (45 commits) block: handle REQ_OP_ZONE_APPEND in __bio_integrity_action block: fix GFP_ flags confusion in bio_integrity_alloc_buf block, bfq: don't grab queue_lock to initialize bfq mm/page_io: don't nest queue_lock under rcu in bio_associate_blkg_from_page() blk-cgroup: don't nest queue_lock under blkcg->lock in blkcg_destroy_blkgs() blk-cgroup: don't nest queue_lock under rcu in bio_associate_blkg() blk-cgroup: don't nest queue_lock under rcu in blkg_lookup_create() blk-cgroup: don't nest queue_lock under rcu in blkcg_print_blkgs() blk-cgroup: delay freeing policy data after rcu grace period blk-cgroup: protect iterating blkgs with blkcg->lock in blkcg_print_stat() md/raid5: avoid R5_Overlap races while breaking stripe batches md/raid5: use stripe state snapshot in break_stripe_batch_list() blk-cgroup: defer blkcg css_put until blkg is unlinked from queue blk-cgroup: fix UAF in __blkcg_rstat_flush() block, bfq: protect async queue reset with blkcg locks nbd: don't warn when reclassifying a busy socket lock block: fix incorrect error injection static key decrement md/raid5: let stripe batch bm_seq comparison wrap-safe md/raid1: protect head_position for read balance md/raid1: free r1_bio when REQ_NOWAIT is set and read would block on retry ...
4 daysmm/page_io: don't nest queue_lock under rcu in bio_associate_blkg_from_page()Yu Kuai
Take a css reference under RCU, drop RCU, and then associate the bio with the blkg. This avoids nesting queue_lock under RCU and prepares to protect blkcg with blkcg_mutex instead of queue_lock. Use css_tryget() instead of css_tryget_online() so swap writeback for pages charged to a dying memcg still passes the dying css to bio_associate_blkg_from_css(). That preserves the existing closest-live ancestor fallback instead of charging those bios to the root blkg. Signed-off-by: Yu Kuai <yukuai@fygo.io> Link: https://patch.msgid.link/c910d2c39d3ec97f67de68af636a52394342d55f.1780621988.git.yukuai@fygo.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 daysMerge tag 'mm-stable-2026-06-23-08-55' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "khugepaged: add mTHP collapse support" (Nico Pache) Provide khugepaged with the capability to collapse anonymous memory regions to mTHPs - "Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files" (Zi Yan) Remove the READ_ONLY_THP_FOR_FS check in file_thp_enabled(), so that khugepaged and MADV_COLLAPSE can run on filesystems with PMD THP pagecache support even without READ_ONLY_THP_FOR_FS enabled - "make MM selftests more CI friendly" (Mike Rapoport) General fixes and cleanups to the MM selftests. Also move more MM selftests under the kselftest framework, making them more amenable to ongoing CI testing - "selftests/mm: fix failures and robustness improvements" and "selftests/mm: assorted fixes for hmm-tests" (Sayali Patil) Fix several issues in MM selftests which were revealed by powerpc 64k pagesize * tag 'mm-stable-2026-06-23-08-55' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (118 commits) Revert "mm: limit filemap_fault readahead to VMA boundaries" mm/vmscan: pass NULL to trace vmscan node reclaim mm: use mapping_mapped to simplify the code selftests/mm: fix exclusive_cow test fork() handling selftests/mm: remove hardcoded THP sizing assumptions in hmm tests selftests/mm: allow PUD-level entries in compound testcase of hmm tests mm/gup_test: reject wrapped user ranges mm/page_frag: reject invalid CPUs in page_frag_test mm/damon/core: always put unsuccessfully committed target pids mm: page_isolation: avoid unsafe folio reads while scanning compound pages mm/shrinker: do not hold RCU lock in shrinker_debugfs_count_show() selftests: mm: fix and speedup "droppable" test mm: merge writeout into pageout MAINTAINERS: add Hao Ge as reviewer for codetag and alloc_tag selftests/mm: clarify alternate unmapping in compaction_test selftests/mm: move hwpoison setup into run_test() and silence modprobe output for memory-failure category selftests/mm: skip uffd-stress test when nr_pages_per_cpu is zero selftests/mm: skip uffd-wp-mremap if UFFD write-protect is unsupported selftests/mm: ensure destination is hugetlb-backed in hugetlb-mremap selftest/mm: register existing mapping with userfaultfd in hugetlb-mremap ...
6 daysMerge tag 'slab-for-7.2-part2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull more slab updates from Vlastimil Babka: - Introduce and wire up a new alloc_flags parameter for modifying slab-specific behavior without adding or reusing gfp flags. Also introduce slab_alloc_context to keep function parameter bloat in check. Both are similar to what the page allocator does. kmalloc_flags() exposes alloc_flags for mm-internal users. - SLAB_ALLOC_NOLOCK flag is used to implement kmalloc_nolock() behavior without relying on lack of __GFP_RECLAIM, which caused false positives with workarounds like fd3634312a04 ("debugobject: Make it work with deferred page initialization - again"). - SLAB_ALLOC_NO_RECURSE replaces __GFP_NO_OBJ_EXT, which could have been removed, but pending memory allocation profiling changes in mm tree have grown a new user - there is however a work ongoing to replace that too, so __GFP_NO_OBJ_EXT should eventually be removed. (Vlastimil Babka) - Add kmem_buckets_alloc_track_caller() with a user to be added in the net tree (Pedro Falcato) - Fixes for kernel-doc and slabinfo (Randy Dunlap, Yichong Chen) * tag 'slab-for-7.2-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: tools/mm/slabinfo: fix total_objects attribute name slab: recognize @GFP parameter as optional in kernel-doc mm/slab: add a node-track-caller variant for kmem buckets allocation mm/slab: replace __GFP_NO_OBJ_EXT with SLAB_ALLOC_NO_RECURSE for sheaves mm/slab: remove __GFP_NO_OBJ_EXT usage from alloc_slab_obj_exts() mm/slab: introduce kmalloc_flags() mm/slab: allow __GFP_NOMEMALLOC and __GFP_NOWARN for kmalloc_nolock() mm/slab: pass slab_alloc_context to __do_kmalloc_node() mm/slab: allow kmem_cache_alloc_bulk() with any gfp flags mm/slab: replace slab_alloc_node() parameters with slab_alloc_context mm/slab: pass alloc_flags through slab_post_alloc_hook() chain mm/slab: pass alloc_flags to new slab allocation mm/slab: add alloc_flags to slab_alloc_context mm/slab: replace struct partial_context with slab_alloc_context mm/slab: introduce alloc_flags and SLAB_ALLOC_NOLOCK mm/slab: introduce slab_alloc_context mm/slab: stop inlining __slab_alloc_node() mm/slab: do not init any kfence objects on allocation
7 daysMerge tag 'mm-nonmm-stable-2026-06-21-10-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "taskstats: fix TGID dead-thread stat retention" (Yiyang Chen) Fix a taskstats TGID aggregation bug where fields added in the TGID query path were not preserved after thread exit, and adds a kselftest covering the regression. - "lib/tests: string_helpers: Slight improvements" (Andy Shevchenko) Improve lib/tests/string_helpers_kunit.c a little - "lib/base64: decode fixes" (Josh Law) Address minor issues in lib/base64.c - "selftests/filelock: Make output more kselftestish" (Mark Brown) Make the output from the ofdlocks test a bit easier for tooling to work with. Also ignore the generated file - "uaccess: unify inline vs outline copy_{from,to}_user() selection" (Yury Norov) Simplify the usercopy code by removing the selectability of inlining copy_{from,to}_user(). - "ocfs2: validate inline xattr header consumers" (ZhengYuan Huang) Fix a number of possible issues in the ocfs2 xattr code - "lib and lib/cmdline enhancements" (Dmitry Antipov) Provide additional robustness checking in the cmdline handling code and its in-kernel testing and selftests - "cleanup the RAID6 P/Q library" (Christoph Hellwig) Clean up the RAID6 P/Q library to match the recent updates to the RAID 5 XOR library and other CRC/crypto libraries - "ocfs2: harden inode validators against forged metadata" (Michael Bommarito) Add three structural checks to OCFS2 dinode validation so malformed on-disk fields are rejected before ocfs2_populate_inode() copies them into the in-core inode - "lib/raid: replace __get_free_pages() call with kmalloc()" (Mike Rapoport) Clean up the lib/raid code by using kmalloc() in more places * tag 'mm-nonmm-stable-2026-06-21-10-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (108 commits) ocfs2: fix circular locking dependency in ocfs2_dio_end_io_write ocfs2: fix NULL h_transaction deref in ocfs2_assure_trans_credits lib: interval_tree_test: validate benchmark parameters ocfs2: avoid moving extents to occupied clusters treewide: fix transposed "sign" typos and update spelling.txt ocfs2: fix UBSAN array-index-out-of-bounds in ocfs2_sum_rightmost_rec fat: reject BPB volumes whose data area starts beyond total sectors selftests/uevent: increase __UEVENT_BUFFER_SIZE to avoid ENOBUFS on busy systems lib/test_firmware: allocate the configured into_buf size fs: efs: remove unneeded debug prints checkpatch: cuppress warnings when Reported-by: is followed by Link: MAINTAINERS: add Alexander as a kcov reviewer mailmap: update Alexander Sverdlin's Email addresses fs: fat: inode: replace sprintf() with scnprintf() ocfs2: fix out-of-bounds write in ocfs2_remove_refcount_extent ocfs2: fix race between ocfs2_control_install_private() and ocfs2_control_release() ocfs2/dlm: require a ref for locking_state debugfs open ocfs2: reject FITRIM ranges shorter than a cluster ocfs2: validate fast symlink target during inode read ocfs2: add journal NULL check in ocfs2_checkpoint_inode() ...
7 daysRevert "mm: limit filemap_fault readahead to VMA boundaries"Lorenzo Stoakes
This reverts commit 7b32f64bc512b40b268776c5ac4d354b325b3197. This patch caused a significant performance regression, so revert it, and we can determine whether the approach is sensible or not moving forwards, and if so how to avoid this. There was a merge conflict with commit de97ae6222c1 ("mm/readahead: no PG_readahead on EOF"), care was taken to ensure that the revert retained the behaviour of this patch and cleanly reverts commit 7b32f64bc512 ("mm: limit filemap_fault readahead to VMA boundaries") only. Link: https://lore.kernel.org/20260619112852.104213-1-ljs@kernel.org Fixes: 7b32f64bc512 ("mm: limit filemap_fault readahead to VMA boundaries") Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202606181547.617a6967-lkp@intel.com Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Jan Kara <jack@suse.cz> Cc: Kalesh Singh <kaleshsingh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/vmscan: pass NULL to trace vmscan node reclaimBen Dooks
The tracepoint for node relcaims takes a `struct mem_cgroup *` as the third argument, so pass NULL instead of 0 to fix warning about using an integer as a pointer. Fixes the following warnings: mm/vmscan.c:6753:66: warning: Using plain integer as NULL pointer mm/vmscan.c:6757:58: warning: Using plain integer as NULL pointer mm/vmscan.c:7818:60: warning: Using plain integer as NULL pointer Link: https://lore.kernel.org/20260616095906.210016-1-ben.dooks@codethink.co.uk Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm: use mapping_mapped to simplify the codeHuang Shijie
Use mapping_mapped() to simplify the code, make the code tidy and clean. Link: https://lore.kernel.org/20260612073032.33228-1-huangsj@hygon.cn Signed-off-by: Huang Shijie <huangsj@hygon.cn> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Oscar Salvador (SUSE) <osalvador@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/gup_test: reject wrapped user rangesSamuel Moelius
gup_test accepts an address and size from the debugfs ioctl and repeatedly compares against addr + size. If that addition wraps, the loop can be skipped and the ioctl returns success with size rewritten to zero. Compute the end address once with overflow checking and use that checked end for the loop bounds. Assisted-by: Codex:gpt-5.5-cyber-preview Link: https://lore.kernel.org/20260609004814.1240586.6294d614ac80.gup-test-range-end-wrap@trailofbits.com Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/damon/core: always put unsuccessfully committed target pidsSeongJae Park
damon_commit_target() puts and gets the destination and the source target pids. It puts the destination target pid because it will be overwritten by the source target pid. It gets the source pid because the caller is supposed to eventually put the pids. In more detail, the caller will call damon_destroy_ctx() after damon_commit_ctx() to destroy the entire source context. And in this case, [f]vaddr operation set's cleanup_target() callback will put the pids. The commit operation is made at the context level. The operation can fail in multiple places including in the middle and after the targets commit operations. For any such failures, immediately the error is returned to the damon_commit_ctx() caller. If some or all of the source target pids were committed to the destination during the unsuccessful context commit attempt, those pids should be put twice. The source context will do the put operations using the above explained routine. However, let's suppose the destination context was not originally using [f]vaddr operation set and the commit failed before the ops of the source context is committed. The destination does not have the cleanup_target() ops callback, so it cannot put the pids via the damon_destroy_ctx(). As a result, the pids are leaked. The issue in the real world would be not very common. The commit feature is for changing parameters of running DAMON context while inheriting internal status like the monitoring results. The monitoring results of a physical address range ain't have things that are beneficial to be inherited to a virtual address ranges monitoring. So the problem-causing DAMON control would be not very common in the real world. That said, it is a supported feature. And damon_commit_target() failure due to memory allocation is relatively realistic [1] if there are a huge number of target regions. Fix by putting the pids in the commit operation in case of the failures. The issue was discovered [2] by Sashiko. Link: https://lore.kernel.org/20260605013849.83750-1-sj@kernel.org Link: https://lore.kernel.org/20260603112306.58490-1-akinobu.mita@gmail.com [1] Link: https://lore.kernel.org/20260320020056.835-1-sj@kernel.org [2] Fixes: 83dc7bbaecae ("mm/damon/sysfs: use damon_commit_ctx()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.11.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm: page_isolation: avoid unsafe folio reads while scanning compound pagesKaitao Cheng
page_is_unmovable() can inspect compound pages without holding a folio reference or any lock. The folio can therefore be freed, split or reused while the scanner is still looking at it. The existing HugeTLB handling already avoids folio_hstate() for this reason, but it still derives the hstate from folio_size() and later derives the scan step from folio_nr_pages() and folio_page_idx(). These helpers rely on the folio still being a valid folio head. If the folio changed concurrently, the scanner can read inconsistent folio metadata and compute a wrong step. In the worst case, folio_nr_pages() can return 1 for what used to be a tail page and the subtraction from folio_page_idx() can underflow. There is a similar issue for non-Hugetlb compound pages: folio_test_lru() expects a valid folio. If the previously observed head page has been reused as a tail page of another compound page, the folio flag checks can trigger VM_BUG_ON_PGFLAGS(). Read the compound order once with compound_order(), reject obviously bogus orders, and derive the hstate and scan step from that order instead of querying folio size information again. Also use PageLRU(page), which is safe for the page being scanned, instead of folio_test_lru() on a potentially stale folio pointer. Treat an unknown HugeTLB hstate as unmovable so the scanner does not try to skip over an unstable HugeTLB folio. Link: https://lore.kernel.org/20260602130755.38794-1-kaitao.cheng@linux.dev Fixes: a0a9f2180b90 ("mm: page_isolation: avoid calling folio_hstate() without hugetlb_lock") Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Oscar Salvador (SUSE) <osalvador@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/shrinker: do not hold RCU lock in shrinker_debugfs_count_show()Shakeel Butt
Reading the debugfs "count" file of a memcg-aware shrinker can sleep inside an RCU read-side critical section: BUG: sleeping function called from invalid context at kernel/cgroup/rstat.c:421 RCU nest depth: 1, expected: 0 css_rstat_flush mem_cgroup_flush_stats zswap_shrinker_count shrinker_debugfs_count_show shrinker_debugfs_count_show() invokes the ->count_objects() callback under rcu_read_lock(). The zswap callback flushes memcg stats via css_rstat_flush(), which may sleep, so it must not run under RCU. The RCU lock is not needed here. mem_cgroup_iter() takes RCU internally and returns a memcg holding a css reference (dropped on the next iteration or by mem_cgroup_iter_break()), so the memcg stays alive without it. The shrinker is kept alive by the open debugfs file: shrinker_free() removes the debugfs entries via debugfs_remove_recursive(), which waits for in-flight readers to drain, before call_rcu(..., shrinker_free_rcu_cb). The sibling "scan" handler already invokes the sleeping ->scan_objects() callback with no RCU section. Drop the rcu_read_lock()/rcu_read_unlock(). Link: https://lore.kernel.org/20260610232048.62930-1-shakeel.butt@linux.dev Fixes: 5035ebc644ae ("mm: shrinkers: introduce debugfs interface for memory shrinkers") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reported-by: Zenghui Yu <zenghui.yu@linux.dev> Closes: https://lore.kernel.org/all/c052a064-cddb-494f-a0d8-f8a10b4b1c4d@linux.dev/ Suggested-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Qi Zheng <qi.zheng@linux.dev> Tested-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Dave Chinner <david@fromorbit.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm: merge writeout into pageoutChristoph Hellwig
writeout is only called from pageout, and a straight flow at the end, so merge the two functions. Link: https://lore.kernel.org/20260601113449.3464734-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Baoquan He <baoquan.he@linux.dev> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/khugepaged: avoid underflow in madvise_collapse for sub-PMD MADV_COLLAPSEChen Wandun
madvise_collapse() computes the THP-aligned window: hstart = ALIGN(start, HPAGE_PMD_SIZE); /* round up */ hend = ALIGN_DOWN(end, HPAGE_PMD_SIZE); /* round down */ The following case will cause hstart > hend, and result in underflow in the return statement, avoid it by returning zero early when hstart > hend. The return value is due to input is valid to madvise(), and there is nothing to collapse. madvise(PMD-aligned + PAGE_SIZE, PAGE_SIZE, MADV_COLLAPSE); In addition, kmalloc_obj(), mmgrab() and lru_add_drain_all() are unnecessary when hstart == hend, so skip these operations by returning early too. Link: https://lore.kernel.org/20260513055428.1664898-1-chenwandun@lixiang.com Signed-off-by: Chen Wandun <chenwandun@lixiang.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam R. Howlett <liam@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/khugepaged: enable clean pagecache folio collapse for writable filesZi Yan
collapse_file() is capable of collapsing pagecache folios from writable files to PMD folios. Now enable clean pagecache folio collapse in addition to read-only pagecache folio collapse by removing the inode_is_open_for_write() from file_thp_enabled() and only performing filemap_flush() if the file is read-only. This means userspace needs to explicitly flush the content of pagecache folios before khugepaged can collapse the folios, or use madvise(MADV_COLLAPSE), which does the flush in the retry. The reason is that blindly enabling dirty pagecache folio from writable files collapse makes khugepaged flush these folios all the time. It is undesirable to cause system level pagecache flushes. To properly support dirty pagecache folio collapse, filemap_flush() needs to be avoided. Potentially, merging associated buffer instead of dropping it with filemap_release_folio() might be needed. NOTE: this breaks khugepaged selftests for writable file pagecache collapse, which is set to fail all the time. The next commit fixes it. Link: https://lore.kernel.org/20260517135416.1434539-14-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/truncate: use folio_split() in truncate_inode_partial_folio()Zi Yan
After READ_ONLY_THP_FOR_FS is removed, FS either supports large folio or not. folio_split() can be used on a FS with large folio support without worrying about getting a THP on a FS without large folio support. When READ_ONLY_THP_FOR_FS was present, a PMD large pagecache folio can appear in a FS without large folio support after khugepaged or madvise(MADV_COLLAPSE) creates it. During truncate_inode_partial_folio(), such a PMD large pagecache folio is split and if the FS does not support large folio, it needs to be split to order-0 ones and could not be split non uniformly to ones with various orders. try_folio_split_to_order() was added to handle this situation by checking folio_check_splittable(..., SPLIT_TYPE_NON_UNIFORM) to detect if the large folio is created due to READ_ONLY_THP_FOR_FS and the FS does not support large folio. Now READ_ONLY_THP_FOR_FS is removed, all large pagecache folios are created with FSes supporting large folio, this function is no longer needed and all large pagecache folios can be split non uniformly. Link: https://lore.kernel.org/20260517135416.1434539-10-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/huge_memory: remove folio split check for READ_ONLY_THP_FOR_FSZi Yan
Without READ_ONLY_THP_FOR_FS, large file-backed folios cannot be created by a FS without large folio support. The check is no longer needed. Link: https://lore.kernel.org/20260517135416.1434539-9-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Barry Song <baohua@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam Howlett <liam@infradead.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm: fs: remove filemap_nr_thps*() functions and their usersZi Yan
They are used by READ_ONLY_THP_FOR_FS to handle writes to FSes without large folio support, so that read-only THPs created in these FSes are not seen by the FSes when the underlying fd becomes writable. Now read-only PMD THPs only appear in a FS with large folio support and the supported orders include PMD_ORDER. READ_ONLY_THP_FOR_FS was using mapping->nr_thps, inode->i_writecount, and smp_mb() to prevent writes to a read-only THP and collapsing writable folios into a THP. In collapse_file(), mapping->nr_thps is increased, then smp_mb(), and if inode->i_writecount > 0, collapse is stopped, while do_dentry_open() first increases inode->i_writecount, then a full memory fence, and if mapping->nr_thps > 0, all read-only THPs are truncated. Now this mechanism can be removed along with READ_ONLY_THP_FOR_FS code, since a dirty folio check has been added after try_to_unmap() in collapse_file() to prevent dirty folios from being collapsed as clean. Link: https://lore.kernel.org/20260517135416.1434539-7-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Barry Song <baohua@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm: remove READ_ONLY_THP_FOR_FS Kconfig optionZi Yan
After removing READ_ONLY_THP_FOR_FS check in file_thp_enabled(), khugepaged and MADV_COLLAPSE can run on FSes with PMD THP pagecache support even without READ_ONLY_THP_FOR_FS enabled. Remove the Kconfig first so that no one can use READ_ONLY_THP_FOR_FS as upcoming commits remove mapping->nr_thps, which its safe guard mechanism relies on. Link: https://lore.kernel.org/20260517135416.1434539-6-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Nico Pache <npache@redhat.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Barry Song <baohua@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam Howlett <liam@infradead.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/khugepaged: remove READ_ONLY_THP_FOR_FS check in hugepage_enabled()Zi Yan
Remove the READ_ONLY_THP_FOR_FS gate and khugepaged for file-backed pmd-sized hugepages are enabled by the global transparent hugepage control. khugepaged can still be enabled by per-size control for anon and shmem when the global control is off. Add shmem_hpage_pmd_enabled() stub for !CONFIG_SHMEM to remove IS_ENABLED(SHMEM) in hugepage_enabled(). Clean up hugepage_enabled() by moving anon code to anon_hpage_enabled(). Link: https://lore.kernel.org/20260517135416.1434539-5-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Nico Pache <npache@redhat.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Barry Song <baohua@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()Zi Yan
Replace it with a check on the max folio order of the file's address space mapping, making sure PMD folio is supported. Keep the inode open-for-write check, since even if collapse_file() now makes sure all to-be-collapsed folios are clean and the created PMD file THP can be handled by FSes properly, the filemap_flush() could perform undesirable write back. Link: https://lore.kernel.org/20260517135416.1434539-4-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Nico Pache <npache@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/khugepaged: add folio dirty check after try_to_unmap()Zi Yan
This check ensures the correctness of read-only PMD folio collapse after it is enabled for all FSes supporting PMD pagecache folios and replaces READ_ONLY_THP_FOR_FS. READ_ONLY_THP_FOR_FS only supports read-only fd and uses mapping->nr_thps and inode->i_writecount to prevent any write to read-only to-be-collapsed folios. In upcoming commits, READ_ONLY_THP_FOR_FS will be removed and the aforementioned mechanism will go away too. To ensure khugepaged functions as expected after the changes, skip if any folio is dirty after try_to_unmap(), since a dirty folio at that point means this read-only folio can get writes between try_to_unmap() and try_to_unmap_flush() via cached TLB entries and khugepaged does not support writable pagecache folio collapse yet. Link: https://lore.kernel.org/20260517135416.1434539-3-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Nico Pache <npache@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Barry Song <baohua@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/khugepaged: remove READ_ONLY_THP_FOR_FS checkZi Yan
Patch series "Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files", v6. This patch (of 14): collapse_file() requires FSes supporting large folio with at least PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. MADV_COLLAPSE ignores shmem huge config, so exclude the check for shmem. While at it, replace VM_BUG_ON with VM_WARN_ON_ONCE. Add a helper function mapping_pmd_folio_support() for FSes supporting large folio with at least PMD_ORDER. Link: https://lore.kernel.org/20260517135416.1434539-1-ziy@nvidia.com Link: https://lore.kernel.org/20260517135416.1434539-2-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Nico Pache <npache@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Barry Song <baohua@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/khugepaged: fix PMD collapse swap PTE accountingLance Yang
mthp_collapse() uses mthp_present_ptes to decide whether a range has enough occupied PTEs to try collapse. Swap PTEs accepted by collapse_scan_pmd() are counted in unmapped, but are not represented in mthp_present_ptes. When lower orders are enabled, collapse_scan_pmd() relaxes max_ptes_none so the scan can cover the whole PMD and build the bitmap. mthp_collapse() then checks the PMD-order candidate using the bitmap. With max_ptes_none set to 0, a range with 511 present PTEs and one swap PTE no longer reaches collapse_huge_page(), even though PMD collapse can handle swap PTEs up to max_ptes_swap. Account unmapped PTEs only for PMD order. PMD collapse supports swap PTEs through max_ptes_swap, while lower-order mTHP collapse does not currently support non-present PTEs. Keep non-present PTEs out of the lower-order eligibility check. Link: https://lore.kernel.org/20260609120443.71864-1-lance.yang@linux.dev Fixes: 90ed32d00054 ("mm/khugepaged: introduce mTHP collapse support") Signed-off-by: Lance Yang <lance.yang@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: Nico Pache <npache@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/khugepaged: run khugepaged for all ordersBaolin Wang
If any order (m)THP is enabled we should allow running khugepaged to attempt scanning and collapsing mTHPs. In order for khugepaged to operate when only mTHP sizes are specified in sysfs, we must modify the predicate function that determines whether it ought to run to do so. This function is currently called hugepage_pmd_enabled(), this patch renames it to hugepage_enabled() and updates the logic to check to determine whether any valid orders may exist which would justify khugepaged running. We must also update collapse_possible_orders() to check all orders if the vma is anonymous and the collapse is khugepaged. After this patch khugepaged mTHP collapse is fully enabled. Link: https://lore.kernel.org/20260605161422.213817-14-npache@redhat.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Acked-by: Usama Arif <usama.arif@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/khugepaged: avoid unnecessary mTHP collapse attemptsNico Pache
There are cases where, if an attempted collapse fails, all subsequent orders are guaranteed to also fail. Avoid these collapse attempts by bailing out early. Link: https://lore.kernel.org/20260605161422.213817-13-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: Usama Arif <usama.arif@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/khugepaged: introduce mTHP collapse supportNico Pache
Enable khugepaged to collapse to mTHP orders. This patch implements the main scanning logic using a bitmap to track occupied pages and the algorithm to find optimal collapse sizes. Previous to this patch, PMD collapse had 3 main phases, a light weight scanning phase (mmap_read_lock) that determines a potential PMD collapse, an alloc phase (mmap unlocked), then finally heavier collapse phase (mmap_write_lock). To enabled mTHP collapse we make the following changes: During PMD scan phase, track occupied pages in a bitmap. When mTHP orders are enabled, we remove the restriction of max_ptes_none during the scan phase to avoid missing potential mTHP collapse candidates. Once we have scanned the full PMD range and updated the bitmap to track occupied pages, we use the bitmap to find the optimal mTHP size. Implement mthp_collapse() to walk forward through the bitmap and determine the best eligible order for each naturally-aligned region. The algorithm starts at the beginning of the PMD range and, for each offset, tries the highest order that fits the alignment. If the number of occupied PTEs in that region satisfies the max_ptes_none threshold for that order, a collapse is attempted. On failure, the order is decremented and the same offset is retried at the next smaller size. Once the smallest enabled order is exhausted (or a collapse succeeds), the offset advances past the region just processed, and the next attempt starts at the highest order permitted by the new offset's natural alignment. The algorithm works as follows: 1) set offset=0 and order=HPAGE_PMD_ORDER 2) if the order is not enabled, go to step (5) 3) count occupied PTEs in the (offset, order) range using bitmap_weight_from() 4) if the count satisfies the max_ptes_none threshold, attempt collapse; on success, advance to step (6) 5) if a smaller enabled order exists, decrement order and retry from step (2) at the same offset 6) advance offset past the current region and compute the next order from the new offset's natural alignment via __ffs(offset), capped at HPAGE_PMD_ORDER 7) repeat from step (2) until the full PMD range is covered mTHP collapses reject regions containing swapped out or shared pages. This is because adding new entries can lead to new none pages, and these may lead to constant promotion into a higher order mTHP. A similar issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse introducing at least 2x the number of pages, and on a future scan will satisfy the promotion condition once again. This issue is prevented via the collapse_max_ptes_none() function which imposes the max_ptes_none restrictions above. We currently only support mTHP collapse for max_ptes_none values of 0 and HPAGE_PMD_NR - 1. resulting in the following behavior: - max_ptes_none=0: Never introduce new empty pages during collapse - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest available mTHP order Any other max_ptes_none value will emit a warning and default mTHP collapse to max_ptes_none=0. There should be no behavior change for PMD collapse. Once we determine what mTHP sizes fits best in that PMD range a collapse is attempted. A minimum collapse order of 2 is used as this is the lowest order supported by anon memory as defined by THP_ORDERS_ALL_ANON. Currently madv_collapse is not supported and will only attempt PMD collapse. We can also remove the check for is_khugepaged inside the PMD scan as the collapse_max_ptes_none() function handles this logic now. Link: https://lore.kernel.org/20260605161422.213817-12-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam R. Howlett <liam@infradead.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Usama Arif <usama.arif@linux.dev> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/khugepaged: introduce collapse_possible_orders helper functionsNico Pache
Add collapse_possible_orders() to generalize THP order eligibility. The function determines which THP orders are permitted based on collapse context (khugepaged vs madv_collapse). We also add collapse_possible() as a thin wrapper around collapse_possible_orders() that returns a bool rather than the whole bitmap. This consolidates collapse configuration logic and provides a clean interface for future mTHP collapse support where the orders may be different. Link: https://lore.kernel.org/20260605161422.213817-11-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Usama Arif <usama.arif@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/khugepaged: improve tracepoints for mTHP ordersNico Pache
Add the order to the mm_collapse_huge_page<_swapin,_isolate> tracepoints to give better insight into what order is being operated at for. Link: https://lore.kernel.org/20260605161422.213817-10-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Usama Arif <usama.arif@linux.dev> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/khugepaged: add per-order mTHP collapse failure statisticsNico Pache
Add three new mTHP statistics to track collapse failures for different orders when encountering swap PTEs, excessive none PTEs, and shared PTEs: - collapse_exceed_swap_pte: Increment when mTHP collapse fails due to encountering a swap PTE. - collapse_exceed_none_pte: Counts when mTHP collapse fails due to exceeding the none PTE threshold for the given order - collapse_exceed_shared_pte: Counts when mTHP collapse fails due to encountering a shared PTE. These statistics complement the existing THP_SCAN_EXCEED_* events by providing per-order granularity for mTHP collapse attempts. The stats are exposed via sysfs under `/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each supported hugepage size. As we currently do not support collapsing mTHPs that contain a swap or shared entry, those statistics keep track of how often we are encountering failed mTHP collapses due to these restrictions. We will add support for mTHP collapse for anonymous pages next; lets also track when this happens at the PMD level within the per-mTHP stats. Link: https://lore.kernel.org/20260605161422.213817-9-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Usama Arif <usama.arif@linux.dev> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/khugepaged: skip collapsing mTHP to smaller ordersNico Pache
khugepaged may try to collapse a mTHP to a folio of equal or smaller size, possibly resulting in a partially mapped source folio, which is undesired. Skip these cases until we have a way to check if its ok to collapse to a smaller mTHP size (like in the case of a partially mapped folio). This check is not done during the scan phase as the current collapse order is unknown at that time. This patch is inspired by Dev Jain's work on khugepaged mTHP support [1]. Link: https://lore.kernel.org/20260605161422.213817-8-npache@redhat.com Link: https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/ [1] Co-developed-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand (arm) <david@kernel.org> Acked-by: Usama Arif <usama.arif@linux.dev> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/khugepaged: generalize collapse_huge_page for mTHP collapseNico Pache
Pass an order to collapse_huge_page to support collapsing anon memory to arbitrary orders within a PMD. order indicates what mTHP size we are attempting to collapse to. For non-PMD collapse we must leave the anon VMA write locked until after we collapse the mTHP-- in the PMD case all the pages are isolated, but in the mTHP case this is not true, and we must keep the lock to prevent access/changes to the page tables. This can happen if the rmap walkers hit a pmd_none while the PMD entry is currently unavailable due to being temporarily removed during the collapse phase. To properly establish the page table hierarchy without violating any expectations from certain architectures (e.g. MIPS), we must make sure to have the PMD reinstalled before the PTEs, and hold both PTE/PMD locks before calling update_mmu_cache_range() (if they are distinct locks). Link: https://lore.kernel.org/20260605161422.213817-7-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Usama Arif <usama.arif@linux.dev> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/khugepaged: require collapse_huge_page to enter/exit with the lock droppedNico Pache
Currently the collapse_huge_page function requires the mmap_read_lock to enter with it held, and exit with it dropped. This function moves the unlock into its parent caller, and changes this semantic to requiring it to enter/exit with it always unlocked. In future patches, we need this expectation, as for in mTHP collapse, we may have already dropped the lock, and do not want to conditionally check for this by passing through the lock_dropped variable. No functional change is expected as one of the first things the collapse_huge_page function does is drop this lock before allocating the hugepage. Link: https://lore.kernel.org/20260605161422.213817-6-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Usama Arif <usama.arif@linux.dev> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/khugepaged: generalize __collapse_huge_page_* for mTHP supportNico Pache
generalize the order of the __collapse_huge_page_* and collapse_max_* functions to support future mTHP collapse. The current mechanism for determining collapse with the khugepaged_max_ptes_none value is not designed with mTHP in mind. This raises a key design issue: if we support user defined max_pte_none values (even those scaled by order), a collapse of a lower order can introduces an feedback loop, or "creep", when max_ptes_none is set to a value greater than HPAGE_PMD_NR / 2. [1] With this configuration, a successful collapse to order N will populate enough pages to satisfy the collapse condition on order N+1 on the next scan. This leads to unnecessary work and memory churn. To fix this issue introduce a helper function that will limit mTHP collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1. This effectively supports two modes: [2] - max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE that maps the shared zeropage. Consequently, no memory bloat. - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest available mTHP order. This removes the possibility of "creep", and a warning will be emitted if any non-supported max_ptes_none value is configured with mTHP enabled. Any intermediate value will default mTHP collapse to max_ptes_none=0. mTHP collapse will not honor the khugepaged_max_ptes_shared or khugepaged_max_ptes_swap parameters, and will fail if it encounters a shared or swapped entry. No functional changes in this patch; however it defines future behavior for mTHP collapse. Link: https://lore.kernel.org/20260605161422.213817-5-npache@redhat.com Link: https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.com [1] Link: https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com [2] Co-developed-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: David Hildenbrand (arm) <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Usama Arif <usama.arif@linux.dev> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/khugepaged: rework max_ptes_* handling with helper functionsNico Pache
The following cleanup reworks all the max_ptes_* handling into helper functions. This increases the code readability and will later be used to implement the mTHP handling of these variables. With these changes we abstract all the madvise_collapse() special casing (do not respect the sysctls) away from the functions that utilize them. And will be used later in this series to cleanly restrict the mTHP collapse behavior. No functional change is intended; however, we are now only reading the sysfs variables once per scan, whereas before these variables were being read on each loop iteration. Link: https://lore.kernel.org/20260605161422.213817-4-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Suggested-by: David Hildenbrand <david@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Usama Arif <usama.arif@linux.dev> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/khugepaged: generalize alloc_charge_folio()Dev Jain
Pass order to alloc_charge_folio() and update mTHP statistics. Link: https://lore.kernel.org/20260605161422.213817-3-npache@redhat.com Signed-off-by: Dev Jain <dev.jain@arm.com> Co-developed-by: Nico Pache <npache@redhat.com> Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Usama Arif <usama.arif@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/khugepaged: generalize hugepage_vma_revalidate for mTHP supportNico Pache
Patch series "khugepaged: add mTHP collapse support", v19. The following series provides khugepaged with the capability to collapse anonymous memory regions to mTHPs. To achieve this we generalize the khugepaged functions to no longer depend on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual pages that are occupied (!none/zero). After the PMD scan is done, we use the bitmap to find the optimal mTHP sizes for the PMD range. The restriction on max_ptes_none is removed during the scan, to make sure we account for the whole PMD range in the bitmap. When no mTHP size is enabled, the legacy behavior of khugepaged is maintained. We currently only support max_ptes_none values of 0 or HPAGE_PMD_NR - 1 (ie 511). If any other value is specified, the kernel will emit a warning and mTHP collapse will default to max_ptes_none=0. If a mTHP collapse is attempted, but contains swapped out, or shared pages, we don't perform the collapse. It is now also possible to collapse to mTHPs without requiring the PMD THP size to be enabled. These limitations are to prevent collapse "creep" behavior. This prevents constantly promoting mTHPs to the next available size, which would occur because a collapse introduces more non-zero pages that would satisfy the promotion condition on subsequent scans. Patch 1-2: Generalize hugepage_vma_revalidate and alloc_charge_folio for arbitrary orders. Patch 3: Rework max_ptes_* handling into helper functions Patch 4: Generalize __collapse_huge_page_* for mTHP support Patch 5: Require collapse_huge_page to enter/exit with the lock dropped Patch 6: Generalize collapse_huge_page for mTHP collapse Patch 7: Skip collapsing mTHP to smaller orders Patch 8-9: Add per-order mTHP statistics and tracepoints Patch 10: Introduce collapse_possible_orders helper functions Patch 11-13: Introduce bitmap and mTHP collapse support, fully enabled Patch 14: Documentation Testing: - Built for x86_64, aarch64, ppc64le, and s390x - ran all arches on test suites provided by the kernel-tests project - internal testing suites: functional testing and performance testing - selftests mm - I created a test script that I used to push khugepaged to its limits while monitoring a number of stats and tracepoints. The code is available here[1] (Run in legacy mode for these changes and set mthp sizes to inherit) The summary from my testings was that there was no significant regression noticed through this test. In some cases my changes had better collapse latencies, and was able to scan more pages in the same amount of time/work, but for the most part the results were consistent. - redis testing. I did some testing with these changes along with my defer changes (see followup [2] post for more details). We've decided to get the mTHP changes merged first before attempting the defer series. - some basic testing on 64k page size. - lots of general use. This patch (of 14): For khugepaged to support different mTHP orders, we must generalize this to check if the PMD is not shared by another VMA and that the order is enabled. We cannot collapse VMA regions that do not span the full PMD. This is due to the potential of the PMD being shared by another VMA which leaves us vulnerable to race conditions if neighboring VMAs are resized. Always check the PMD order here to ensure its not shared by another VMA. We'd need to lock all VMAs in the PMD range to support this which may lead to increased lock contention and code complexity. No functional change in this patch. Also correct a comment about the functionality of the revalidation and fix a double space issues. Link: https://lore.kernel.org/20260605161422.213817-1-npache@redhat.com Link: https://lore.kernel.org/20260605161422.213817-2-npache@redhat.com Link: https://gitlab.com/npache/khugepaged_mthp_test [1] Link: https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/ [2] Co-developed-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Usama Arif <usama.arif@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/page_alloc: only update NUMA min ratios on sysctl writeJianlin Shi
The sysctl handlers for min_unmapped_ratio and min_slab_ratio invoke setup_min_unmapped_ratio() and setup_min_slab_ratio() unconditionally after proc_dointvec_minmax(), even for read operations. These setup functions first zero all per-NUMA node thresholds (min_unmapped_pages and min_slab_pages) before recalculating them. Reading /proc sysctl entries therefore temporarily resets node reclaim thresholds to zero, which may disturb the behavior of __node_reclaim() and node_reclaim() during the recomputation. Fix this by only calling the setup functions when the sysctl is actually written (write == 1), matching the behavior of existing sysctl handlers like min_free_kbytes and watermark_scale_factor. This only affects systems with CONFIG_NUMA. Link: https://lore.kernel.org/tencent_5891052AF9A4C2D490A62F478D446F74AB09@qq.com Signed-off-by: Jianlin Shi <shijianlin11@foxmail.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 dayszsmalloc: simplify data output in zs_stats_size_show()Markus Elfring
Move the specification for a line break from a seq_puts() call to a seq_printf() call. The source code was transformed by using the Coccinelle software. Link: https://lore.kernel.org/126a924b-6f68-43bf-ae5a-449fb93e527b@web.de Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysmm/alloc_tag: replace fixed-size early PFN array with dynamic linked listHao Ge
Pages allocated before page_ext is available have their codetag left uninitialized. Track these early PFNs and clear their codetag in clear_early_alloc_pfn_tag_refs() to avoid "alloc_tag was not set" warnings when they are freed later. Currently a fixed-size array of 8192 entries is used, with a warning if the limit is exceeded. However, the number of early allocations depends on the number of CPUs and can be larger than 8192. Replace the fixed-size array with a dynamically allocated linked list of pfn_pool structs. Each node is allocated via alloc_page() and mapped to a pfn_pool containing a next pointer, an atomic slot counter, and a PFN array that fills the remainder of the page. The tracking pages themselves are allocated via alloc_page(), which would trigger __pgalloc_tag_add() -> alloc_tag_add_early_pfn() and recurse indefinitely. Introduce __GFP_NO_CODETAG (reuses the %__GFP_NO_OBJ_EXT bit) and pass gfp_flags through pgalloc_tag_add() so that the early path can skip recording allocations that carry this flag. Link: https://lore.kernel.org/20260604024008.46592-1-hao.ge@linux.dev Signed-off-by: Hao Ge <hao.ge@linux.dev> Suggested-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
7 daysMerge tag 'liveupdate-v7.2-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux Pull liveupdate updates from Mike Rapoport: "Kexec Handover (KHO): - make memory preservation compatible with deferred initialization of the memory map Live Update Orchestrator (LUO): - add LIVEUPDATE_SESSION_GET_NAME ioctl and parameter verification for LIVEUPDATE_IOCTL_CREATE_SESSION ioctl - documentation updates for liveupdate=on command line option, systemd support and the current compatibility status - remove the fixed limits on the number of files that can be preserved within a single session, and the total number of sessions managed by the LUO Misc fixes: - reference count incoming File-Lifecycle-Bound (FLB) data so it cannot be freed while a subsystem is still using it - fixes for a TOCTOU race in luo_session_retrieve(), a use- after-free in the file finish and unpreserve paths, concurrent session mutations during reboot and serialization on preserve_context kexec - make sure ioctls for incoming LUO sessions are blocked for outgoing sessions and vice versa - make sure KHO scratch size is always aligned by CMA_MIN_ALIGNMENT_BYTES - fix memblock tests build issue introduced by KHO changes" * tag 'liveupdate-v7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux: (36 commits) liveupdate: Document that retrieve failure is permanent docs: memfd_preservation: fix rendering of ABI documentation selftests/liveupdate: Add stress-files kexec test selftests/liveupdate: Add stress-sessions kexec test selftests/liveupdate: Test session and file limit removal liveupdate: Remove limit on the number of files per session liveupdate: Remove limit on the number of sessions liveupdate: defer session block allocation and physical address setting kho: add support for linked-block serialization liveupdate: Extract luo_session_deserialize_one helper liveupdate: Extract luo_file_deserialize_one helper liveupdate: register luo_ser as KHO subtree liveupdate: centralize state management into struct luo_ser liveupdate: avoid mixing cleanup guards with goto in luo_session_retrieve_fd liveupdate: change file_set->count type to u64 for type safety liveupdate: Remove unused ser field from struct luo_session liveupdate: fix u-a-f in luo_file_unpreserve_files() and luo_file_finish() liveupdate: block session mutations during reboot liveupdate: fix TOCTOU race in luo_session_retrieve() liveupdate: skip serialization for context-preserving kexec ...
9 daysMerge tag 'mm-stable-2026-06-18-09-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "selftests/mm: clean up build output and verbosity" (Li Wang) Remove some noise from the MM selftests build - "mm: Free contiguous order-0 pages efficiently" (Ryan Roberts) Speed up the freeing of a batch of 0-order pages by first scanning them for coalescing opportunities. This is applicable to vfree() and to the releasing of frozen pages - "mm/damon: introduce DAMOS failed region quota charge ratio" (SeongJae Park) Address a DAMOS usability issue: The DAMOS quota often exhausts prematurely because it charges for all memory attempted, causing slow and inconsistent performance when actions fail on unreclaimable memory. To fix this, a new feature lets users set a smaller, flexible quota charge ratio (via a numerator and denominator) for failed regions. Since failed actions cause less overhead, reducing their quota cost ensures more predictable and efficient DAMOS processing - "selftests/cgroup: improve zswap tests robustness and support large page sizes" (Li Wang) Fix various spurious failures and improves the overall robustness of the cgroup zswap selftests - "fix MAP_DROPPABLE not supported errno" (Anthony Yznaga) Fix an issue in the mlock selftests on arm32 - "mm: huge_memory: clean up defrag sysfs with shared" (Breno Leitao) Some maintenance work in the huge_memory code - "treewide: fixup gfp_t printks" (Brendan Jackman) Use the special vprintf() gfp_t conversion in various places - "mm: Fix vmemmap optimization accounting and initialization" (Muchun Song) Fix several bugs in the vmemmap optimization, mainly around incorrect page accounting and memmap initialization in the DAX and memory hotplug paths. It also fixes pageblock migratetype initialization and struct page initialization for ZONE_DEVICE compound pages - "mm/damon: repost non-hotfix reviewed patches in damon/next tree" A sprinkle of unrelated minor bugfixes for DAMON - "mm: remove page_mapped()" (David Hildenbrand) Remove this function from the tree, replacing it with folio_mapped() - "mm/damon: let DAMON be paused and resumed" (SeongJae Park) Allow DAMON to be paused and resumed without losing its current state - "kasan: hw_tags: Disable tagging for stack and page-tables" (Muhammad Usama Anjum) Simplify and speed up kasan by removing its ineffective tagging of stacks and page tables - "mm/damon/reclaim,lru_sort: monitor all system rams by default" (SeongJae Park) Simplify deployment on diverse hardware like NUMA systems by updating DAMON_RECLAIM and DAMON_LRU_SORT to automatically monitor the physical address range covering all System RAM areas by default, replacing the overly restrictive behavior that only targeted the single largest memory block to save on negligible overhead - "mm/damon/sysfs: document filters/ directory as deprecated" (SeongJae Park) Update some DAMON docs - "mm: use spinlock guards for zone lock" (Dmitry Ilvokhin) Switch zone->lock handling over to using the guard() mechanisms - "mm/filemap: tighten mmap_miss hit accounting" (fujunjie) Fix a flaw where the mmap_miss counter over-credited page cache hits during fault-arounds and page-fault retries. This results in significant reduction of redundant synchronous mmap readahead I/O, drastically cutting down execution time and gigabytes read for sparse random or strided memory access workloads - "selftests/cgroup: Fix false positive failures in test_percpu_basic" (Li Wang) Fix a couple of false-positives in the cgroup kmem selftests - "mm/damon/reclaim: support monitoring intervals auto-tuning" (SeongJae Park) Add a new parameter to DAMON permitting DAMON_RECLAIM to automatically tune DAMON's sampling and aggregation intervals - "mm/damon/stat: add kdamond_pid parameter" (SeongJae Park) Change DAMON_STAT to provide the pid of its kdamond - "mm/kmemleak: dedupe verbose scan output" (Breno Leitao) Remove large amounts of duplicated backtraces from the verbose-mode kmemleak output - "mm: remove CONFIG_HAVE_BOOTMEM_INFO_NODE (Part 1)" (David Hildenbrand) Reduce our use of CONFIG_HAVE_BOOTMEM_INFO_NODE, with a view to removing it entirely in a later series - "mm/damon: validate min_region_size to be power of 2" (Liew Rui Yan) Prevent users from passing a non-power-of-2 value of `addr_unit', as this later results in undesirable behavior - "mm: document read_pages and simplify usage" (Frederick Mayle) - "tools/mm/page-types: Fix misc bugs" (Ye Liu) Fix three issues in tools/mm/page-types.c - "mm: misc cleanups from __GFP_UNMAPPED series" (Brendan Jackman) Implement several cleanups in the page allocator and related code - "mm, swap: swap table phase IV: unify allocation" (Kairui Song) Unify the allocation and charging of anon and shmem swap in folios, provides better synchronization, consolidates the metadata management, hence dropping the static array and map, and improves performance - "mm/damon: introduce data attributes monitoring" (SeongJae Park( Extend DAMON to monitor general data attributes other than accesses - "mm/vmalloc: free unused pages on vrealloc() shrink" (Shivam Kalra) Implement the TODO in vrealloc() to unmap and free unused pages when shrinking across a page boundary - "mm/damon: documentation and comment fixes" (niecheng) - "remove mmap_action success, error hooks" (Lorenzo Stoakes) Eliminate custom hooks from mmap_action by removing the problematic success_hook which allowed drivers to improperly access uninitialized VMAs. It replaces the error_hook with a simple error-code field and updates the memory char driver accordingly - "mm/damon: minor improvements for code readability and tests" (SeongJae Park) - "mm/damon: fix macro arguments and clarify quota goals doc" (Maksym Shcherba) - "userfaultfd: merge fs/userfaultfd.c into mm/userfaultfd.c" (Mike Rapoport) - "mm/mglru: improve reclaim loop and dirty folio" (Kairui Song and others) Clean up and slightly improves MGLRU's reclaim loop and dirty writeback handling. Large performance improvements are measured - "use vma locks for proc/pid/{smaps|numa_maps} reads" (Suren Baghdasaryan) Use per-vma locks when reading /proc/pid/smaps and numa_maps similar to reduce contention on central mmap_lock - "refactors thpsize_shmem_enabled_store() and thpsize_shmem_enabled_show()" (Ran Xiaokai) Some cleanup work in the THP code - "selftests/memfd: fix compilation warnings" (Konstantin Khorenko) Fix a few build glitches in the memfd selftest code. - "memcg: shrink obj_stock_pcp and cache multiple objcgs" (Shakeel Butt) Resolve a 68% performance regression caused by NUMA-node cache thrashing around struct obj_stock_pcp by shrinking its existing fields and expanding it into a multi-slot array that caches up to five obj_cgroup pointers per CPU, allowing per-node variants of the same memcg to coexist within a single 64-byte cache line. - "zram: writeback fixes" (Sergey Senozhatsky) address a couple of unrelated zram writeback issues - "mm: switch THP shrinker to list_lru" (Johannes Weiner) Resolve NUMA-awareness issues and streamlines callsite interaction by refactoring and extending the list_lru API to completely replace the complex, open-coded deferred split queue for Transparent Huge Pages - "mm: improve large folio readahead for exec memory" (Usama Arif) Improve large-folio readahead on systems like 64K-page arm64 by preventing the mmap_miss check from permanently disabling target-oriented VM_EXEC readahead, and by generalizing the force_thp_readahead gate to support mappings with any usefully large maximum folio order under the cache cap. - "userfaultfd/pagemap: pre-existing fixes" (Kiryl Shutsemau) Fix a bunch of minor issues in the userfaultfd/pagemap, all of which were flagged by Sashiko review of proposed new material - "mm/sparse-vmemmap: Provide generic vmemmap_set_pmd() and vmemmap_check_pmd()" (Muchun Song) Provide generic versions of these two functions so the four arch-specific implementations can be removed. - "mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap device" (Youngjun Park) Address a uswsusp-vs-swapoff race and reduces the swap device reference taking/releasing frequency. - "mm/hmm: A fix and a selftest" (Dev Jain) * tag 'mm-stable-2026-06-18-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) selftests/mm/hmm-tests: test pagemap reads of PMD device-private entries fs/proc/task_mmu: do not warn on seeing non-migration pmd entry lib/test_hmm: check alloc_page_vma() return value and handle OOM mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX mm/swap: remove redundant swap device reference in alloc/free mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap device mm/filemap: use folio_next_index() for start vmalloc: fix NULL pointer dereference in is_vm_area_hugepages() sparc/mm: drop vmemmap_check_pmd helper and use generic code loongarch/mm: drop vmemmap_check_pmd helper and use generic code riscv/mm: drop vmemmap_pmd helpers and use generic code arm64/mm: drop vmemmap_pmd helpers and use generic code mm/sparse-vmemmap: provide generic vmemmap_set_pmd() and vmemmap_check_pmd() rust: page: mark Page::nid as inline userfaultfd: build __VMA_UFFD_FLAGS from config-gated masks userfaultfd: gate must_wait writability check on pte_present() mm/huge_memory: preserve pmd_swp_uffd_wp on device-private PMD downgrade fs/proc/task_mmu: fix hugetlb self-deadlock in pagemap_scan_pte_hole() fs/proc/task_mmu: use huge_page_size() in pagemap_scan_hugetlb_entry() fs/proc/task_mmu: fix make_uffd_wp_huge_pte() prot-update race ...
11 daysmm/slab: replace __GFP_NO_OBJ_EXT with SLAB_ALLOC_NO_RECURSE for sheavesVlastimil Babka (SUSE)
Finish the switch away from __GFP_NO_OBJ_EXT by replacing it with SLAB_ALLOC_NO_RECURSE when allocating empty sheaves. Pass alloc_flags to [__]alloc_empty_sheaf(). Callers that can't be part of a recursive kmalloc() chain simply pass SLAB_ALLOC_DEFAULT. Use kmalloc_flags() instead of kzalloc() for allocating the sheaf. With that we can finalize the removal the __GFP_NO_OBJ_EXT handling from obj_ext allocations as well, leaving only SLAB_ALLOC_NO_RECURSE in place. This leaves __GFP_NO_OBJ_EXT with no users in slab, so stop allowing the flag in kmalloc_nolock(). Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-16-7190909db118@kernel.org Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
11 daysmm/slab: remove __GFP_NO_OBJ_EXT usage from alloc_slab_obj_exts()Vlastimil Babka (SUSE)
__GFP_NO_OBJ_EXT has limited scope within the slab allocator itself and gfp flags are a scarce resource, unlike slab's alloc_flags. Introduce SLAB_ALLOC_NO_RECURSE alloc flag that has the same intent as __GFP_NO_OBJ_EXT but a more generic name, meaning that a kmalloc() family function should not recurse into another kmalloc*() for the purposes of allocating auxiliary structures (obj_ext arrays or sheaves). First, replace the __GFP_NO_OBJ_EXT for allocating obj_ext arrays in alloc_slab_obj_exts(). Make use of the newly added kmalloc_flags() function, where we can pass alloc_flags with SLAB_ALLOC_NO_RECURSE added. This will also pass through SLAB_ALLOC_NOLOCK so we don't need to special case kmalloc_nolock() anymore. Note that until now the kmalloc_nolock() ignored the incoming gfp flags and hardcoded __GFP_ZERO | __GFP_NO_OBJ_EXT. But it's correct to pass on the incoming gfp flags (only augmented with __GFP_ZERO), because if alloc_flags contain SLAB_ALLOC_NOLOCK, the incoming gfp flags have to be also compatible with it. However, we might have added __GFP_THISNODE for opportunistic slab allocation, as pointed out by Hao Li, and __GFP_COMP by allocate_slab() as pointed out by Shengming Hu. Solve this by adding both flags to OBJCGS_CLEAR_MASK as it makes sense to strip them anyway for non-kmalloc_nolock() allocations of sheaves or obj_ext arrays as well. To avoid recursion of sheaf -> obj_ext -> sheaf -> ... allocations at this patch, until the next patch converts sheaves to SLAB_ALLOC_NO_RECURSE, use both gfp and alloc_flags for obj_ext. The next patch will remove the gfp part. Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-15-7190909db118@kernel.org Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Hao Li <hao.li@linux.dev> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
11 daysMerge tag 'memblock-v7.2-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock updates from Mike Rapoport: "Small fixes and a cleanup: - numa emulation: fix detection of under-allocated emulated nodes - memblock tests: fix NUMA tests to properly differentiate reserved areas with differnet flags - mm_init: use div64_ul() instead of do_div() to better express the intent of the division" * tag 'memblock-v7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: mm: mm_init: use div64_ul() instead of do_div() tools/testing/memblock: fix stale NUMA reservation tests mm/fake-numa: fix under-allocation detection in uniform split
13 daysMerge tag 'for-7.2/block-20260615' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - NVMe pull request via Keith: - Per-controller admin and IO timeout sysfs attributes, and letting the block layer set request timeouts (Maurizio, Maximilian) - Multipath passthrough iostats, and PCI P2PDMA enablement for multipath devices (Keith, Kiran) - A new diag sysfs attribute group exporting per-controller counters (retries, multipath failover, error counters, requeue and failure counts, reset and reconnect events) (Nilay) - FDP configuration validation and bounds check fixes (liuxixin) - Various nvmet fixes, including a pre-auth out-of-bounds read in the Discovery Get Log Page handler, auth payload bounds validation, and tcp error-path leak fixes (Bryam, Tianchu, Geliang) - nvme-tcp lockdep and workqueue fixes (Shin'ichiro, Kuniyuki, Eric) - Assorted other fixes and cleanups (John, Yao, Chao, Mateusz, Achkinazi, Wentao) - MD pull request via Yu Kuai: - raid1/raid10 fixes for a deadlock in the read error recovery path, error-path detection and bio accounting with cloned bios, and an nr_pending leak in the REQ_ATOMIC bad-block error path (Abd-Alrhman) - PCI P2PDMA propagation from member devices to the RAID device (Kiran) - dm-raid bio requeue fix, and various smaller fixes and cleanups (Benjamin, Chen, Li, Thorsten) - Enable Clang lock context analysis for the block layer, with the accompanying annotations across queue limits, the blk_holder_ops callbacks, crypto, cgroup, iocost, kyber and mq-deadline (Bart) - Block status code infrastructure work: a tagged status table, a str_to_blk_op() helper, a bio_endio_status() helper, and on top of that a new configurable block-layer error injection facility (Christoph) - DRBD netlink rework, replacing the genl_magic machinery with explicit netlink serialization and moving the DRBD UAPI headers to include/uapi/linux/ (Christoph Böhmwalder) - bvec improvements: a bvec_folio() helper and making the bvec_iter helpers proper inline functions (Willy, Christoph) - ublk cleanups and a canceling-flag fix for the disk-not-allocated case (Caleb, Ming) - Partition handling fixes: bound the AIX pp_count scan, fix an of_node refcount leak, and replace __get_free_page() with kmalloc() (Bryam, Wentao, Mike) - Convert numa_node to int in blk_mq_hw_ctx and ->init_request, and add WQ_PERCPU to the block workqueue users (Mateusz, Marco) - Block statistics and tracing: propagate in-flight to the whole disk on partition IO, export passthrough stats, and a new block_rq_tag_wait tracepoint (Tang, Keith, Aaron) - A round of removals, unexports and cleanups across bio, direct-io and the bvec helpers (Christoph) - Various driver fixes (mtip32xx use-after-free, rbd snap_count validation and strscpy conversion, nbd socket lockdep reclassify, virtio-blk zone report clamp, floppy) and a batch of MAINTAINERS email/list updates (Coly, Li, Yu, Christoph Böhmwalder) - Other little fixes and cleanups all over * tag 'for-7.2/block-20260615' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (117 commits) MAINTAINERS: Update Coly Li's email address block: check bio split for unaligned bvec nbd: Reclassify sockets to avoid lockdep circular dependency block: add configurable error injection block: add a str_to_blk_op helper block: add a "tag" for block status codes block: add a macro to initialize the status table floppy: Drop unused pnp driver data block: propagate in_flight to whole disk on partition I/O virtio-blk: clamp zone report to the report buffer capacity block: optimize I/O merge hot path with unlikely() hints drivers/block/rbd: Use strscpy() to copy strings into arrays partitions: aix: bound the pp_count scan to the ppe array block: Enable lock context analysis block/mq-deadline: Make the lock context annotations compatible with Clang block/Kyber: Make the lock context annotations compatible with Clang block/blk-mq-debugfs: Improve lock context annotations block/blk-iocost: Inline iocg_lock() and iocg_unlock() block/blk-iocost: Split ioc_rqos_throttle() block/crypto: Annotate the crypto functions ...
13 daysMerge tag 'slab-for-7.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - Support for "allocation tokens" (currently available in Clang 22+) for smarter partitioning of kmalloc caches based on the allocated object type, which can be enabled instead of the "random" per-caller-address-hash partitioning. It should be able to deterministically separate types containing a pointer from those that do not (Marco Elver) - Improvements and simplification of the kmem_cache_alloc_bulk() and mempool_alloc_bulk() API. This includes adaptation of callers (Christoph Hellwig) - Performance improvements and cleanups related mostly to sheaves refill (Hao Li, Shengming Hu, Vlastimil Babka) - Several fixups for the slabinfo tool (Xuewen Wang) * tag 'slab-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab: do not limit zeroing to orig_size when only red zoning is enabled mm/slub: preserve original size in _kmalloc_nolock_noprof retry path mm: simplify the mempool_alloc_bulk API mm/slab: improve kmem_cache_alloc_bulk mm/slub: detach and reattach partial slabs in batch mm/slub: introduce helpers for node partial slab state mm/slub: use empty sheaf helpers for oversized sheaves tools/mm/slabinfo: remove redundant slab->partial assignment tools/mm/slabinfo: remove dead assignment in get_obj_and_str() tools/mm/slabinfo: Fix trace disable logic inversion MAINTAINERS: add slab-related scripts and tools to SLAB ALLOCATOR mm/slub: fix typo in sheaves comment mm, slab: simplify returning slab in __refill_objects_node() mm, slab: add an optimistic __slab_try_return_freelist() slab: fix kernel-docs for mm-api slab: improve KMALLOC_PARTITION_RANDOM randomness slab: support for compiler-assisted type-based slab cache partitioning mm/slub: defer freelist construction until after bulk allocation from a new slab
13 daysMerge tag 'arm64-upstream' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "It feels like the new world of AI tooling has slowed us down a little on the feature side when compared to the fixes side. The extra rounds of Sashiko review have also pushed a few things out until next time. Still, there's some good foundational stuff here for the fpsimd code and hardening work towards removing the predictable linear alias of the kernel image. CPU errata handling: - Extend CnP disabling workaround to HiSilicon HIP09 hardware. - Work around eternally broken broadcast TLB invalidation on more CPUs. - Documentation and code cleanups. CPU features: - Add new hwcaps for the 2025 dpISA extensions. Floating point / SVE / SME: - Significant cleanup to the low-level state management code in the core architecture code and KVM. - Use correct register widths during SVE/SME save/restore assembly. - Expose SVE/SME save/restore memory accesses to sanitisers. Memory management: - Preparatory work for unmapping the kernel data and bss sections from the linear map. Miscellaneous: - Inline DAIF manipulation helpers so they can be used safely from non-instrumentable code. - Fix handling of the 'nosmp' cmdline option to avoid marking secondary cores as "possible". MPAM: - Add support for v0.1 of the MPAM architecture. Perf: - Update HiSilicon PMU MAINTAINERS entry. - Fix event encodings for the DVM node in the CMN driver. Selftests: - Extend sigframe tests to cover POE context. - Add coverage for the newly added 2025 dpISA hwcaps. System registers: - Add new registers and ESR encodings for the HDBSS feature. Plus minor fixes and cleanups across the board" * tag 'arm64-upstream' of gitolite.kernel.org:pub/scm/linux/kernel/git/arm64/linux: (73 commits) arm64: errata: Mitigate TLBI errata on Microsoft Azure Cobalt 100 CPU arm64: errata: Mitigate TLBI errata on NVIDIA Olympus CPU arm64: errata: Mitigate TLBI errata on various Arm CPUs arm64: cputype: Add C1-Premium definitions arm64: cputype: Add C1-Ultra definitions Revert "arm64: mm: Unmap kernel data/bss entirely from the linear map" Revert "arm64: mm: Defer remap of linear alias of data/bss" arm64: arch_timer: reuse arch_timer_read_cnt{p,v}ct_el0() helpers arm64/mm: Rename ptdesc_t arm64: mm: Defer remap of linear alias of data/bss KVM: arm64: Omit tag sync on stage-2 mappings of the zero page arm64: Avoid double evaluation of __ptep_get() kasan: Move generic KASAN page tables out of BSS too arm64: Rename page table BSS section to .bss..pgtbl arm64: patching: replace min_t with min in __text_poke perf/arm-cmn: Fix DVM node events arm64: fpsimd: Remove <asm/fpsimdmacros.h> arm64: fpsimd: Move SME save/restore inline arm64: fpsimd: Move sve_flush_live() inline arm64: fpsimd: Move SVE save/restore inline ...
13 daysmm/slab: introduce kmalloc_flags()Vlastimil Babka (SUSE)
With alloc_flags usage in slab, we can replace __GFP_NO_OBJ_EXT with an alloc flag that prevents kmalloc recursion. For that we need a version of kmalloc() that takes alloc_flags and use it in places that perform these potentially recursive kmalloc allocations (of sheaves or obj_ext arrays). Add this function, named kmalloc_flags(). Right now it's only useful for these nested allocations, so it doesn't need to optimize build-time constant sizes like kmalloc() or kmalloc_buckets. Since we need it to support both normal and non-spinning kmalloc_nolock() context through the SLAB_ALLOC_NOLOCK flag, split out most of the special _kmalloc_nolock_noprof() implementation to __kmalloc_nolock_noprof() that takes a slab_alloc_context, and make _kmalloc_nolock_noprof() a simple tail calling wrapper with the proper context. kmalloc_flags() can thus determine whether to call __kmalloc_nolock_noprof() or __do_kmalloc_node(), based on the given alloc_flags. Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-14-7190909db118@kernel.org Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
13 daysmm/slab: allow __GFP_NOMEMALLOC and __GFP_NOWARN for kmalloc_nolock()Vlastimil Babka (SUSE)
The two flags are added internally so there's no point for warning if they are passed by the caller as well, so allow them. This will allow simplifying obj_ext allocation under kmalloc_nolock(). Also it's not necessary to have the extra alloc_gfp variable for adding the two flags. The original gfp_flags parameter is not used anywhere except for the warning. So remove alloc_gfp and directly modify and use gfp_flags everywhere. Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-13-7190909db118@kernel.org Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>