<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/mm/memory.c, branch v6.12</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>mm: avoid unconditional one-tick sleep when swapcache_prepare fails</title>
<updated>2024-10-29T04:40:41+00:00</updated>
<author>
<name>Barry Song</name>
<email>v-songbaohua@oppo.com</email>
</author>
<published>2024-09-26T21:19:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=01626a18230246efdcea322aa8f067e60ffe5ccd'/>
<id>01626a18230246efdcea322aa8f067e60ffe5ccd</id>
<content type='text'>
Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
introduced an unconditional one-tick sleep when `swapcache_prepare()`
fails, which has led to reports of UI stuttering on latency-sensitive
Android devices.  To address this, we can use a waitqueue to wake up tasks
that fail `swapcache_prepare()` sooner, instead of always sleeping for a
full tick.  While tasks may occasionally be woken by an unrelated
`do_swap_page()`, this method is preferable to two scenarios: rapid
re-entry into page faults, which can cause livelocks, and multiple
millisecond sleeps, which visibly degrade user experience.

Oven's testing shows that a single waitqueue resolves the UI stuttering
issue.  If a 'thundering herd' problem becomes apparent later, a waitqueue
hash similar to `folio_wait_table[PAGE_WAIT_TABLE_SIZE]` for page bit
locks can be introduced.

[v-songbaohua@oppo.com: wake_up only when swapcache_wq waitqueue is active]
  Link: https://lkml.kernel.org/r/20241008130807.40833-1-21cnbao@gmail.com
Link: https://lkml.kernel.org/r/20240926211936.75373-1-21cnbao@gmail.com
Fixes: 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
Signed-off-by: Barry Song &lt;v-songbaohua@oppo.com&gt;
Reported-by: Oven Liyang &lt;liyangouwen1@oppo.com&gt;
Tested-by: Oven Liyang &lt;liyangouwen1@oppo.com&gt;
Cc: Kairui Song &lt;kasong@tencent.com&gt;
Cc: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Cc: Yu Zhao &lt;yuzhao@google.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Chris Li &lt;chrisl@kernel.org&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Yosry Ahmed &lt;yosryahmed@google.com&gt;
Cc: SeongJae Park &lt;sj@kernel.org&gt;
Cc: Kalesh Singh &lt;kaleshsingh@google.com&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
introduced an unconditional one-tick sleep when `swapcache_prepare()`
fails, which has led to reports of UI stuttering on latency-sensitive
Android devices.  To address this, we can use a waitqueue to wake up tasks
that fail `swapcache_prepare()` sooner, instead of always sleeping for a
full tick.  While tasks may occasionally be woken by an unrelated
`do_swap_page()`, this method is preferable to two scenarios: rapid
re-entry into page faults, which can cause livelocks, and multiple
millisecond sleeps, which visibly degrade user experience.

Oven's testing shows that a single waitqueue resolves the UI stuttering
issue.  If a 'thundering herd' problem becomes apparent later, a waitqueue
hash similar to `folio_wait_table[PAGE_WAIT_TABLE_SIZE]` for page bit
locks can be introduced.

[v-songbaohua@oppo.com: wake_up only when swapcache_wq waitqueue is active]
  Link: https://lkml.kernel.org/r/20241008130807.40833-1-21cnbao@gmail.com
Link: https://lkml.kernel.org/r/20240926211936.75373-1-21cnbao@gmail.com
Fixes: 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
Signed-off-by: Barry Song &lt;v-songbaohua@oppo.com&gt;
Reported-by: Oven Liyang &lt;liyangouwen1@oppo.com&gt;
Tested-by: Oven Liyang &lt;liyangouwen1@oppo.com&gt;
Cc: Kairui Song &lt;kasong@tencent.com&gt;
Cc: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Cc: Yu Zhao &lt;yuzhao@google.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Chris Li &lt;chrisl@kernel.org&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Yosry Ahmed &lt;yosryahmed@google.com&gt;
Cc: SeongJae Park &lt;sj@kernel.org&gt;
Cc: Kalesh Singh &lt;kaleshsingh@google.com&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: fix follow_pfnmap API lockdep assert</title>
<updated>2024-10-18T16:50:05+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2024-10-18T16:50:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=b1b46751671be5a426982f037a47ae05f37ff80b'/>
<id>b1b46751671be5a426982f037a47ae05f37ff80b</id>
<content type='text'>
The lockdep asserts for the new follow_pfnmap() API "knows" that a
pfnmap always has a vma-&gt;vm_file, since that's the only way to create
such a mapping.

And that's actually true for all the normal cases.  But not for the mmap
failure case, where the incomplete mapping is torn down and we have
cleared vma-&gt;vm_file because the failure occured before the file was
linked to the vma.

So this codepath does actually need to check for vm_file being NULL.

Reported-by: Jann Horn &lt;jannh@google.com&gt;
Fixes: 6da8e9634bb7 ("mm: new follow_pfnmap API")
Cc: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The lockdep asserts for the new follow_pfnmap() API "knows" that a
pfnmap always has a vma-&gt;vm_file, since that's the only way to create
such a mapping.

And that's actually true for all the normal cases.  But not for the mmap
failure case, where the incomplete mapping is torn down and we have
cleared vma-&gt;vm_file because the failure occured before the file was
linked to the vma.

So this codepath does actually need to check for vm_file being NULL.

Reported-by: Jann Horn &lt;jannh@google.com&gt;
Fixes: 6da8e9634bb7 ("mm: new follow_pfnmap API")
Cc: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: don't install PMD mappings when THPs are disabled by the hw/process/vma</title>
<updated>2024-10-17T07:28:10+00:00</updated>
<author>
<name>David Hildenbrand</name>
<email>david@redhat.com</email>
</author>
<published>2024-10-11T10:24:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=2b0f922323ccfa76219bcaacd35cd50aeaa13592'/>
<id>2b0f922323ccfa76219bcaacd35cd50aeaa13592</id>
<content type='text'>
We (or rather, readahead logic :) ) might be allocating a THP in the
pagecache and then try mapping it into a process that explicitly disabled
THP: we might end up installing PMD mappings.

This is a problem for s390x KVM, which explicitly remaps all PMD-mapped
THPs to be PTE-mapped in s390_enable_sie()-&gt;thp_split_mm(), before
starting the VM.

For example, starting a VM backed on a file system with large folios
supported makes the VM crash when the VM tries accessing such a mapping
using KVM.

Is it also a problem when the HW disabled THP using
TRANSPARENT_HUGEPAGE_UNSUPPORTED?  At least on x86 this would be the case
without X86_FEATURE_PSE.

In the future, we might be able to do better on s390x and only disallow
PMD mappings -- what s390x and likely TRANSPARENT_HUGEPAGE_UNSUPPORTED
really wants.  For now, fix it by essentially performing the same check as
would be done in __thp_vma_allowable_orders() or in shmem code, where this
works as expected, and disallow PMD mappings, making us fallback to PTE
mappings.

Link: https://lkml.kernel.org/r/20241011102445.934409-3-david@redhat.com
Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
Signed-off-by: David Hildenbrand &lt;david@redhat.com&gt;
Reported-by: Leo Fu &lt;bfu@redhat.com&gt;
Tested-by: Thomas Huth &lt;thuth@redhat.com&gt;
Cc: Thomas Huth &lt;thuth@redhat.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Cc: Christian Borntraeger &lt;borntraeger@linux.ibm.com&gt;
Cc: Janosch Frank &lt;frankja@linux.ibm.com&gt;
Cc: Claudio Imbrenda &lt;imbrenda@linux.ibm.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Kefeng Wang &lt;wangkefeng.wang@huawei.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We (or rather, readahead logic :) ) might be allocating a THP in the
pagecache and then try mapping it into a process that explicitly disabled
THP: we might end up installing PMD mappings.

This is a problem for s390x KVM, which explicitly remaps all PMD-mapped
THPs to be PTE-mapped in s390_enable_sie()-&gt;thp_split_mm(), before
starting the VM.

For example, starting a VM backed on a file system with large folios
supported makes the VM crash when the VM tries accessing such a mapping
using KVM.

Is it also a problem when the HW disabled THP using
TRANSPARENT_HUGEPAGE_UNSUPPORTED?  At least on x86 this would be the case
without X86_FEATURE_PSE.

In the future, we might be able to do better on s390x and only disallow
PMD mappings -- what s390x and likely TRANSPARENT_HUGEPAGE_UNSUPPORTED
really wants.  For now, fix it by essentially performing the same check as
would be done in __thp_vma_allowable_orders() or in shmem code, where this
works as expected, and disallow PMD mappings, making us fallback to PTE
mappings.

Link: https://lkml.kernel.org/r/20241011102445.934409-3-david@redhat.com
Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
Signed-off-by: David Hildenbrand &lt;david@redhat.com&gt;
Reported-by: Leo Fu &lt;bfu@redhat.com&gt;
Tested-by: Thomas Huth &lt;thuth@redhat.com&gt;
Cc: Thomas Huth &lt;thuth@redhat.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Cc: Christian Borntraeger &lt;borntraeger@linux.ibm.com&gt;
Cc: Janosch Frank &lt;frankja@linux.ibm.com&gt;
Cc: Claudio Imbrenda &lt;imbrenda@linux.ibm.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Kefeng Wang &lt;wangkefeng.wang@huawei.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: remove unused stub for can_swapin_thp()</title>
<updated>2024-10-17T07:28:08+00:00</updated>
<author>
<name>Andy Shevchenko</name>
<email>andriy.shevchenko@linux.intel.com</email>
</author>
<published>2024-10-08T19:13:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=a5e8eb25135a48d400e5a695ba9329bc632c3bb4'/>
<id>a5e8eb25135a48d400e5a695ba9329bc632c3bb4</id>
<content type='text'>
When can_swapin_thp() is unused, it prevents kernel builds with clang,
`make W=1` and CONFIG_WERROR=y:

mm/memory.c:4184:20: error: unused function 'can_swapin_thp' [-Werror,-Wunused-function]

Fix this by removing the unused stub.

See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static
inline functions for W=1 build").

Link: https://lkml.kernel.org/r/20241008191329.2332346-1-andriy.shevchenko@linux.intel.com
Fixes: 242d12c98174 ("mm: support large folios swap-in for sync io devices")
Signed-off-by: Andy Shevchenko &lt;andriy.shevchenko@linux.intel.com&gt;
Acked-by: Barry Song &lt;baohua@kernel.org&gt;
Cc: Bill Wendling &lt;morbo@google.com&gt;
Cc: Chuanhua Han &lt;hanchuanhua@oppo.com&gt;
Cc: Justin Stitt &lt;justinstitt@google.com&gt;
Cc: Nathan Chancellor &lt;nathan@kernel.org&gt;
Cc: Nick Desaulniers &lt;ndesaulniers@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When can_swapin_thp() is unused, it prevents kernel builds with clang,
`make W=1` and CONFIG_WERROR=y:

mm/memory.c:4184:20: error: unused function 'can_swapin_thp' [-Werror,-Wunused-function]

Fix this by removing the unused stub.

See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static
inline functions for W=1 build").

Link: https://lkml.kernel.org/r/20241008191329.2332346-1-andriy.shevchenko@linux.intel.com
Fixes: 242d12c98174 ("mm: support large folios swap-in for sync io devices")
Signed-off-by: Andy Shevchenko &lt;andriy.shevchenko@linux.intel.com&gt;
Acked-by: Barry Song &lt;baohua@kernel.org&gt;
Cc: Bill Wendling &lt;morbo@google.com&gt;
Cc: Chuanhua Han &lt;hanchuanhua@oppo.com&gt;
Cc: Justin Stitt &lt;justinstitt@google.com&gt;
Cc: Nathan Chancellor &lt;nathan@kernel.org&gt;
Cc: Nick Desaulniers &lt;ndesaulniers@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm</title>
<updated>2024-09-21T14:29:05+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2024-09-21T14:29:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=617a814f14b8914271f7a70366d72c6196d17663'/>
<id>617a814f14b8914271f7a70366d72c6196d17663</id>
<content type='text'>
Pull MM updates from Andrew Morton:
 "Along with the usual shower of singleton patches, notable patch series
  in this pull request are:

   - "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds
     consistency to the APIs and behaviour of these two core allocation
     functions. This also simplifies/enables Rustification.

   - "Some cleanups for shmem" from Baolin Wang. No functional changes -
     mode code reuse, better function naming, logic simplifications.

   - "mm: some small page fault cleanups" from Josef Bacik. No
     functional changes - code cleanups only.

   - "Various memory tiering fixes" from Zi Yan. A small fix and a
     little cleanup.

   - "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and
     simplifications and .text shrinkage.

   - "Kernel stack usage histogram" from Pasha Tatashin and Shakeel
     Butt. This is a feature, it adds new feilds to /proc/vmstat such as

       $ grep kstack /proc/vmstat
       kstack_1k 3
       kstack_2k 188
       kstack_4k 11391
       kstack_8k 243
       kstack_16k 0

     which tells us that 11391 processes used 4k of stack while none at
     all used 16k. Useful for some system tuning things, but
     partivularly useful for "the dynamic kernel stack project".

   - "kmemleak: support for percpu memory leak detect" from Pavel
     Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory.

   - "mm: memcg: page counters optimizations" from Roman Gushchin. "3
     independent small optimizations of page counters".

   - "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from
     David Hildenbrand. Improves PTE/PMD splitlock detection, makes
     powerpc/8xx work correctly by design rather than by accident.

   - "mm: remove arch_make_page_accessible()" from David Hildenbrand.
     Some folio conversions which make arch_make_page_accessible()
     unneeded.

   - "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David
     Finkel. Cleans up and fixes our handling of the resetting of the
     cgroup/process peak-memory-use detector.

   - "Make core VMA operations internal and testable" from Lorenzo
     Stoakes. Rationalizaion and encapsulation of the VMA manipulation
     APIs. With a view to better enable testing of the VMA functions,
     even from a userspace-only harness.

   - "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix
     issues in the zswap global shrinker, resulting in improved
     performance.

   - "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill
     in some missing info in /proc/zoneinfo.

   - "mm: replace follow_page() by folio_walk" from David Hildenbrand.
     Code cleanups and rationalizations (conversion to folio_walk())
     resulting in the removal of follow_page().

   - "improving dynamic zswap shrinker protection scheme" from Nhat
     Pham. Some tuning to improve zswap's dynamic shrinker. Significant
     reductions in swapin and improvements in performance are shown.

   - "mm: Fix several issues with unaccepted memory" from Kirill
     Shutemov. Improvements to the new unaccepted memory feature,

   - "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on
     DAX PUDs. This was missing, although nobody seems to have notied
     yet.

   - "Introduce a store type enum for the Maple tree" from Sidhartha
     Kumar. Cleanups and modest performance improvements for the maple
     tree library code.

   - "memcg: further decouple v1 code from v2" from Shakeel Butt. Move
     more cgroup v1 remnants away from the v2 memcg code.

   - "memcg: initiate deprecation of v1 features" from Shakeel Butt.
     Adds various warnings telling users that memcg v1 features are
     deprecated.

   - "mm: swap: mTHP swap allocator base on swap cluster order" from
     Chris Li. Greatly improves the success rate of the mTHP swap
     allocation.

   - "mm: introduce numa_memblks" from Mike Rapoport. Moves various
     disparate per-arch implementations of numa_memblk code into generic
     code.

   - "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly
     improves the performance of munmap() of swap-filled ptes.

   - "support large folio swap-out and swap-in for shmem" from Baolin
     Wang. With this series we no longer split shmem large folios into
     simgle-page folios when swapping out shmem.

   - "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice
     performance improvements and code reductions for gigantic folios.

   - "support shmem mTHP collapse" from Baolin Wang. Adds support for
     khugepaged's collapsing of shmem mTHP folios.

   - "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect()
     performance regression due to the addition of mseal().

   - "Increase the number of bits available in page_type" from Matthew
     Wilcox. Increases the number of bits available in page_type!

   - "Simplify the page flags a little" from Matthew Wilcox. Many legacy
     page flags are now folio flags, so the page-based flags and their
     accessors/mutators can be removed.

   - "mm: store zero pages to be swapped out in a bitmap" from Usama
     Arif. An optimization which permits us to avoid writing/reading
     zero-filled zswap pages to backing store.

   - "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race
     window which occurs when a MAP_FIXED operqtion is occurring during
     an unrelated vma tree walk.

   - "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of
     the vma_merge() functionality, making ot cleaner, more testable and
     better tested.

   - "misc fixups for DAMON {self,kunit} tests" from SeongJae Park.
     Minor fixups of DAMON selftests and kunit tests.

   - "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang.
     Code cleanups and folio conversions.

   - "Shmem mTHP controls and stats improvements" from Ryan Roberts.
     Cleanups for shmem controls and stats.

   - "mm: count the number of anonymous THPs per size" from Barry Song.
     Expose additional anon THP stats to userspace for improved tuning.

   - "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more
     folio conversions and removal of now-unused page-based APIs.

   - "replace per-quota region priorities histogram buffer with
     per-context one" from SeongJae Park. DAMON histogram
     rationalization.

   - "Docs/damon: update GitHub repo URLs and maintainer-profile" from
     SeongJae Park. DAMON documentation updates.

   - "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and
     improve related doc and warn" from Jason Wang: fixes usage of page
     allocator __GFP_NOFAIL and GFP_ATOMIC flags.

   - "mm: split underused THPs" from Yu Zhao. Improve THP=always policy.
     This was overprovisioning THPs in sparsely accessed memory areas.

   - "zram: introduce custom comp backends API" frm Sergey Senozhatsky.
     Add support for zram run-time compression algorithm tuning.

   - "mm: Care about shadow stack guard gap when getting an unmapped
     area" from Mark Brown. Fix up the various arch_get_unmapped_area()
     implementations to better respect guard areas.

   - "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability
     of mem_cgroup_iter() and various code cleanups.

   - "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge
     pfnmap support.

   - "resource: Fix region_intersects() vs add_memory_driver_managed()"
     from Huang Ying. Fix a bug in region_intersects() for systems with
     CXL memory.

   - "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches
     a couple more code paths to correctly recover from the encountering
     of poisoned memry.

   - "mm: enable large folios swap-in support" from Barry Song. Support
     the swapin of mTHP memory into appropriately-sized folios, rather
     than into single-page folios"

* tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits)
  zram: free secondary algorithms names
  uprobes: turn xol_area-&gt;pages[2] into xol_area-&gt;page
  uprobes: introduce the global struct vm_special_mapping xol_mapping
  Revert "uprobes: use vm_special_mapping close() functionality"
  mm: support large folios swap-in for sync io devices
  mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
  mm: fix swap_read_folio_zeromap() for large folios with partial zeromap
  mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries
  set_memory: add __must_check to generic stubs
  mm/vma: return the exact errno in vms_gather_munmap_vmas()
  memcg: cleanup with !CONFIG_MEMCG_V1
  mm/show_mem.c: report alloc tags in human readable units
  mm: support poison recovery from copy_present_page()
  mm: support poison recovery from do_cow_fault()
  resource, kunit: add test case for region_intersects()
  resource: make alloc_free_mem_region() works for iomem_resource
  mm: z3fold: deprecate CONFIG_Z3FOLD
  vfio/pci: implement huge_fault support
  mm/arm64: support large pfn mappings
  mm/x86: support large pfn mappings
  ...
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull MM updates from Andrew Morton:
 "Along with the usual shower of singleton patches, notable patch series
  in this pull request are:

   - "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds
     consistency to the APIs and behaviour of these two core allocation
     functions. This also simplifies/enables Rustification.

   - "Some cleanups for shmem" from Baolin Wang. No functional changes -
     mode code reuse, better function naming, logic simplifications.

   - "mm: some small page fault cleanups" from Josef Bacik. No
     functional changes - code cleanups only.

   - "Various memory tiering fixes" from Zi Yan. A small fix and a
     little cleanup.

   - "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and
     simplifications and .text shrinkage.

   - "Kernel stack usage histogram" from Pasha Tatashin and Shakeel
     Butt. This is a feature, it adds new feilds to /proc/vmstat such as

       $ grep kstack /proc/vmstat
       kstack_1k 3
       kstack_2k 188
       kstack_4k 11391
       kstack_8k 243
       kstack_16k 0

     which tells us that 11391 processes used 4k of stack while none at
     all used 16k. Useful for some system tuning things, but
     partivularly useful for "the dynamic kernel stack project".

   - "kmemleak: support for percpu memory leak detect" from Pavel
     Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory.

   - "mm: memcg: page counters optimizations" from Roman Gushchin. "3
     independent small optimizations of page counters".

   - "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from
     David Hildenbrand. Improves PTE/PMD splitlock detection, makes
     powerpc/8xx work correctly by design rather than by accident.

   - "mm: remove arch_make_page_accessible()" from David Hildenbrand.
     Some folio conversions which make arch_make_page_accessible()
     unneeded.

   - "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David
     Finkel. Cleans up and fixes our handling of the resetting of the
     cgroup/process peak-memory-use detector.

   - "Make core VMA operations internal and testable" from Lorenzo
     Stoakes. Rationalizaion and encapsulation of the VMA manipulation
     APIs. With a view to better enable testing of the VMA functions,
     even from a userspace-only harness.

   - "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix
     issues in the zswap global shrinker, resulting in improved
     performance.

   - "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill
     in some missing info in /proc/zoneinfo.

   - "mm: replace follow_page() by folio_walk" from David Hildenbrand.
     Code cleanups and rationalizations (conversion to folio_walk())
     resulting in the removal of follow_page().

   - "improving dynamic zswap shrinker protection scheme" from Nhat
     Pham. Some tuning to improve zswap's dynamic shrinker. Significant
     reductions in swapin and improvements in performance are shown.

   - "mm: Fix several issues with unaccepted memory" from Kirill
     Shutemov. Improvements to the new unaccepted memory feature,

   - "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on
     DAX PUDs. This was missing, although nobody seems to have notied
     yet.

   - "Introduce a store type enum for the Maple tree" from Sidhartha
     Kumar. Cleanups and modest performance improvements for the maple
     tree library code.

   - "memcg: further decouple v1 code from v2" from Shakeel Butt. Move
     more cgroup v1 remnants away from the v2 memcg code.

   - "memcg: initiate deprecation of v1 features" from Shakeel Butt.
     Adds various warnings telling users that memcg v1 features are
     deprecated.

   - "mm: swap: mTHP swap allocator base on swap cluster order" from
     Chris Li. Greatly improves the success rate of the mTHP swap
     allocation.

   - "mm: introduce numa_memblks" from Mike Rapoport. Moves various
     disparate per-arch implementations of numa_memblk code into generic
     code.

   - "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly
     improves the performance of munmap() of swap-filled ptes.

   - "support large folio swap-out and swap-in for shmem" from Baolin
     Wang. With this series we no longer split shmem large folios into
     simgle-page folios when swapping out shmem.

   - "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice
     performance improvements and code reductions for gigantic folios.

   - "support shmem mTHP collapse" from Baolin Wang. Adds support for
     khugepaged's collapsing of shmem mTHP folios.

   - "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect()
     performance regression due to the addition of mseal().

   - "Increase the number of bits available in page_type" from Matthew
     Wilcox. Increases the number of bits available in page_type!

   - "Simplify the page flags a little" from Matthew Wilcox. Many legacy
     page flags are now folio flags, so the page-based flags and their
     accessors/mutators can be removed.

   - "mm: store zero pages to be swapped out in a bitmap" from Usama
     Arif. An optimization which permits us to avoid writing/reading
     zero-filled zswap pages to backing store.

   - "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race
     window which occurs when a MAP_FIXED operqtion is occurring during
     an unrelated vma tree walk.

   - "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of
     the vma_merge() functionality, making ot cleaner, more testable and
     better tested.

   - "misc fixups for DAMON {self,kunit} tests" from SeongJae Park.
     Minor fixups of DAMON selftests and kunit tests.

   - "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang.
     Code cleanups and folio conversions.

   - "Shmem mTHP controls and stats improvements" from Ryan Roberts.
     Cleanups for shmem controls and stats.

   - "mm: count the number of anonymous THPs per size" from Barry Song.
     Expose additional anon THP stats to userspace for improved tuning.

   - "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more
     folio conversions and removal of now-unused page-based APIs.

   - "replace per-quota region priorities histogram buffer with
     per-context one" from SeongJae Park. DAMON histogram
     rationalization.

   - "Docs/damon: update GitHub repo URLs and maintainer-profile" from
     SeongJae Park. DAMON documentation updates.

   - "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and
     improve related doc and warn" from Jason Wang: fixes usage of page
     allocator __GFP_NOFAIL and GFP_ATOMIC flags.

   - "mm: split underused THPs" from Yu Zhao. Improve THP=always policy.
     This was overprovisioning THPs in sparsely accessed memory areas.

   - "zram: introduce custom comp backends API" frm Sergey Senozhatsky.
     Add support for zram run-time compression algorithm tuning.

   - "mm: Care about shadow stack guard gap when getting an unmapped
     area" from Mark Brown. Fix up the various arch_get_unmapped_area()
     implementations to better respect guard areas.

   - "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability
     of mem_cgroup_iter() and various code cleanups.

   - "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge
     pfnmap support.

   - "resource: Fix region_intersects() vs add_memory_driver_managed()"
     from Huang Ying. Fix a bug in region_intersects() for systems with
     CXL memory.

   - "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches
     a couple more code paths to correctly recover from the encountering
     of poisoned memry.

   - "mm: enable large folios swap-in support" from Barry Song. Support
     the swapin of mTHP memory into appropriately-sized folios, rather
     than into single-page folios"

* tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits)
  zram: free secondary algorithms names
  uprobes: turn xol_area-&gt;pages[2] into xol_area-&gt;page
  uprobes: introduce the global struct vm_special_mapping xol_mapping
  Revert "uprobes: use vm_special_mapping close() functionality"
  mm: support large folios swap-in for sync io devices
  mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
  mm: fix swap_read_folio_zeromap() for large folios with partial zeromap
  mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries
  set_memory: add __must_check to generic stubs
  mm/vma: return the exact errno in vms_gather_munmap_vmas()
  memcg: cleanup with !CONFIG_MEMCG_V1
  mm/show_mem.c: report alloc tags in human readable units
  mm: support poison recovery from copy_present_page()
  mm: support poison recovery from do_cow_fault()
  resource, kunit: add test case for region_intersects()
  resource: make alloc_free_mem_region() works for iomem_resource
  mm: z3fold: deprecate CONFIG_Z3FOLD
  vfio/pci: implement huge_fault support
  mm/arm64: support large pfn mappings
  mm/x86: support large pfn mappings
  ...
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge tag 'mm-hotfixes-stable-2024-09-19-00-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm</title>
<updated>2024-09-19T09:35:31+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2024-09-19T09:35:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=839c4f596f898edc424070dc8b517381572f8502'/>
<id>839c4f596f898edc424070dc8b517381572f8502</id>
<content type='text'>
Pull misc hotfixes from Andrew Morton:
 "12 hotfixes, 11 of which are cc:stable.

  Four fixes for longstanding ocfs2 issues and the remainder address
  random MM things"

* tag 'mm-hotfixes-stable-2024-09-19-00-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm/madvise: process_madvise() drop capability check if same mm
  mm/huge_memory: ensure huge_zero_folio won't have large_rmappable flag set
  mm/hugetlb.c: fix UAF of vma in hugetlb fault pathway
  mm: change vmf_anon_prepare() to __vmf_anon_prepare()
  resource: fix region_intersects() vs add_memory_driver_managed()
  zsmalloc: use unique zsmalloc caches names
  mm/damon/vaddr: protect vma traversal in __damon_va_thre_regions() with rcu read lock
  mm: vmscan.c: fix OOM on swap stress test
  ocfs2: cancel dqi_sync_work before freeing oinfo
  ocfs2: fix possible null-ptr-deref in ocfs2_set_buffer_uptodate
  ocfs2: remove unreasonable unlock in ocfs2_read_blocks
  ocfs2: fix null-ptr-deref when journal load failed.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull misc hotfixes from Andrew Morton:
 "12 hotfixes, 11 of which are cc:stable.

  Four fixes for longstanding ocfs2 issues and the remainder address
  random MM things"

* tag 'mm-hotfixes-stable-2024-09-19-00-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm/madvise: process_madvise() drop capability check if same mm
  mm/huge_memory: ensure huge_zero_folio won't have large_rmappable flag set
  mm/hugetlb.c: fix UAF of vma in hugetlb fault pathway
  mm: change vmf_anon_prepare() to __vmf_anon_prepare()
  resource: fix region_intersects() vs add_memory_driver_managed()
  zsmalloc: use unique zsmalloc caches names
  mm/damon/vaddr: protect vma traversal in __damon_va_thre_regions() with rcu read lock
  mm: vmscan.c: fix OOM on swap stress test
  ocfs2: cancel dqi_sync_work before freeing oinfo
  ocfs2: fix possible null-ptr-deref in ocfs2_set_buffer_uptodate
  ocfs2: remove unreasonable unlock in ocfs2_read_blocks
  ocfs2: fix null-ptr-deref when journal load failed.
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: support large folios swap-in for sync io devices</title>
<updated>2024-09-17T08:07:01+00:00</updated>
<author>
<name>Chuanhua Han</name>
<email>hanchuanhua@oppo.com</email>
</author>
<published>2024-09-08T23:21:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=242d12c98174584a18965cfab95778893872d650'/>
<id>242d12c98174584a18965cfab95778893872d650</id>
<content type='text'>
Currently, we have mTHP features, but unfortunately, without support for
large folio swap-ins, once these large folios are swapped out, they are
lost because mTHP swap is a one-way process.  The lack of mTHP swap-in
functionality prevents mTHP from being used on devices like Android that
heavily rely on swap.

This patch introduces mTHP swap-in support.  It starts from sync devices
such as zRAM.  This is probably the simplest and most common use case,
benefiting billions of Android phones and similar devices with minimal
implementation cost.  In this straightforward scenario, large folios are
always exclusive, eliminating the need to handle complex rmap and
swapcache issues.

It offers several benefits:
1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
   swap-out and swap-in. Large folios in the buddy system are also
   preserved as much as possible, rather than being fragmented due
   to swap-in.

2. Eliminates fragmentation in swap slots and supports successful
   THP_SWPOUT.

   w/o this patch (Refer to the data from Chris's and Kairui's latest
   swap allocator optimization while running ./thp_swap_allocator_test
   w/o "-a" option [1]):

   ./thp_swap_allocator_test
   Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53%
   Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58%
   Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34%
   Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51%
   Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84%
   Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91%
   Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05%
   Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25%
   Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74%
   Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01%
   Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45%
   Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98%
   Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64%
   Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36%
   Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02%
   Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07%

   w/ this patch (always 0%):
   Iteration 1: swpout inc: 948, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 2: swpout inc: 953, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 3: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 4: swpout inc: 952, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 5: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 6: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 7: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 8: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 9: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 10: swpout inc: 945, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 11: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
   ...

3. With both mTHP swap-out and swap-in supported, we offer the option to enable
   zsmalloc compression/decompression with larger granularity[2]. The upcoming
   optimization in zsmalloc will significantly increase swap speed and improve
   compression efficiency. Tested by running 100 iterations of swapping 100MiB
   of anon memory, the swap speed improved dramatically:
                time consumption of swapin(ms)   time consumption of swapout(ms)
     lz4 4k                  45274                    90540
     lz4 64k                 22942                    55667
     zstdn 4k                85035                    186585
     zstdn 64k               46558                    118533

    The compression ratio also improved, as evaluated with 1 GiB of data:
     granularity   orig_data_size   compr_data_size
     4KiB-zstd      1048576000       246876055
     64KiB-zstd     1048576000       199763892

   Without mTHP swap-in, the potential optimizations in zsmalloc cannot be
   realized.

4. Even mTHP swap-in itself can reduce swap-in page faults by a factor
   of nr_pages. Swapping in content filled with the same data 0x11, w/o
   and w/ the patch for five rounds (Since the content is the same,
   decompression will be very fast. This primarily assesses the impact of
   reduced page faults):

  swp in bandwidth(bytes/ms)    w/o              w/
   round1                     624152          1127501
   round2                     631672          1127501
   round3                     620459          1139756
   round4                     606113          1139756
   round5                     624152          1152281
   avg                        621310          1137359      +83%

5. With both mTHP swap-out and swap-in supported, we offer the option to enable
   hardware accelerators(Intel IAA) to do parallel decompression with which
   Kanchana reported 7X improvement on zRAM read latency[3].

[1] https://lore.kernel.org/all/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
[2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/
[3] https://lore.kernel.org/all/cover.1714581792.git.andre.glover@linux.intel.com/

Link: https://lkml.kernel.org/r/20240908232119.2157-4-21cnbao@gmail.com
Signed-off-by: Chuanhua Han &lt;hanchuanhua@oppo.com&gt;
Co-developed-by: Barry Song &lt;v-songbaohua@oppo.com&gt;
Signed-off-by: Barry Song &lt;v-songbaohua@oppo.com&gt;
Cc: Baolin Wang &lt;baolin.wang@linux.alibaba.com&gt;
Cc: Chris Li &lt;chrisl@kernel.org&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Gao Xiang &lt;xiang@kernel.org&gt;
Cc: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Kairui Song &lt;kasong@tencent.com&gt;
Cc: Kalesh Singh &lt;kaleshsingh@google.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Nhat Pham &lt;nphamcs@gmail.com&gt;
Cc: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Cc: Sergey Senozhatsky &lt;senozhatsky@chromium.org&gt;
Cc: Shakeel Butt &lt;shakeel.butt@linux.dev&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: Yang Shi &lt;shy828301@gmail.com&gt;
Cc: Yosry Ahmed &lt;yosryahmed@google.com&gt;
Cc: Usama Arif &lt;usamaarif642@gmail.com&gt;
Cc: Kanchana P Sridhar &lt;kanchana.p.sridhar@intel.com&gt;
Cc: Kairui Song &lt;ryncsn@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Currently, we have mTHP features, but unfortunately, without support for
large folio swap-ins, once these large folios are swapped out, they are
lost because mTHP swap is a one-way process.  The lack of mTHP swap-in
functionality prevents mTHP from being used on devices like Android that
heavily rely on swap.

This patch introduces mTHP swap-in support.  It starts from sync devices
such as zRAM.  This is probably the simplest and most common use case,
benefiting billions of Android phones and similar devices with minimal
implementation cost.  In this straightforward scenario, large folios are
always exclusive, eliminating the need to handle complex rmap and
swapcache issues.

It offers several benefits:
1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
   swap-out and swap-in. Large folios in the buddy system are also
   preserved as much as possible, rather than being fragmented due
   to swap-in.

2. Eliminates fragmentation in swap slots and supports successful
   THP_SWPOUT.

   w/o this patch (Refer to the data from Chris's and Kairui's latest
   swap allocator optimization while running ./thp_swap_allocator_test
   w/o "-a" option [1]):

   ./thp_swap_allocator_test
   Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53%
   Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58%
   Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34%
   Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51%
   Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84%
   Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91%
   Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05%
   Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25%
   Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74%
   Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01%
   Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45%
   Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98%
   Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64%
   Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36%
   Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02%
   Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07%

   w/ this patch (always 0%):
   Iteration 1: swpout inc: 948, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 2: swpout inc: 953, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 3: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 4: swpout inc: 952, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 5: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 6: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 7: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 8: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 9: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 10: swpout inc: 945, swpout fallback inc: 0, Fallback percentage: 0.00%
   Iteration 11: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
   ...

3. With both mTHP swap-out and swap-in supported, we offer the option to enable
   zsmalloc compression/decompression with larger granularity[2]. The upcoming
   optimization in zsmalloc will significantly increase swap speed and improve
   compression efficiency. Tested by running 100 iterations of swapping 100MiB
   of anon memory, the swap speed improved dramatically:
                time consumption of swapin(ms)   time consumption of swapout(ms)
     lz4 4k                  45274                    90540
     lz4 64k                 22942                    55667
     zstdn 4k                85035                    186585
     zstdn 64k               46558                    118533

    The compression ratio also improved, as evaluated with 1 GiB of data:
     granularity   orig_data_size   compr_data_size
     4KiB-zstd      1048576000       246876055
     64KiB-zstd     1048576000       199763892

   Without mTHP swap-in, the potential optimizations in zsmalloc cannot be
   realized.

4. Even mTHP swap-in itself can reduce swap-in page faults by a factor
   of nr_pages. Swapping in content filled with the same data 0x11, w/o
   and w/ the patch for five rounds (Since the content is the same,
   decompression will be very fast. This primarily assesses the impact of
   reduced page faults):

  swp in bandwidth(bytes/ms)    w/o              w/
   round1                     624152          1127501
   round2                     631672          1127501
   round3                     620459          1139756
   round4                     606113          1139756
   round5                     624152          1152281
   avg                        621310          1137359      +83%

5. With both mTHP swap-out and swap-in supported, we offer the option to enable
   hardware accelerators(Intel IAA) to do parallel decompression with which
   Kanchana reported 7X improvement on zRAM read latency[3].

[1] https://lore.kernel.org/all/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
[2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/
[3] https://lore.kernel.org/all/cover.1714581792.git.andre.glover@linux.intel.com/

Link: https://lkml.kernel.org/r/20240908232119.2157-4-21cnbao@gmail.com
Signed-off-by: Chuanhua Han &lt;hanchuanhua@oppo.com&gt;
Co-developed-by: Barry Song &lt;v-songbaohua@oppo.com&gt;
Signed-off-by: Barry Song &lt;v-songbaohua@oppo.com&gt;
Cc: Baolin Wang &lt;baolin.wang@linux.alibaba.com&gt;
Cc: Chris Li &lt;chrisl@kernel.org&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Gao Xiang &lt;xiang@kernel.org&gt;
Cc: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Kairui Song &lt;kasong@tencent.com&gt;
Cc: Kalesh Singh &lt;kaleshsingh@google.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Nhat Pham &lt;nphamcs@gmail.com&gt;
Cc: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Cc: Sergey Senozhatsky &lt;senozhatsky@chromium.org&gt;
Cc: Shakeel Butt &lt;shakeel.butt@linux.dev&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: Yang Shi &lt;shy828301@gmail.com&gt;
Cc: Yosry Ahmed &lt;yosryahmed@google.com&gt;
Cc: Usama Arif &lt;usamaarif642@gmail.com&gt;
Cc: Kanchana P Sridhar &lt;kanchana.p.sridhar@intel.com&gt;
Cc: Kairui Song &lt;ryncsn@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios</title>
<updated>2024-09-17T08:07:01+00:00</updated>
<author>
<name>Barry Song</name>
<email>v-songbaohua@oppo.com</email>
</author>
<published>2024-09-08T23:21:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=325efb16da2c840e165d9b620fec8049d4d664cc'/>
<id>325efb16da2c840e165d9b620fec8049d4d664cc</id>
<content type='text'>
With large folios swap-in, we might need to uncharge multiple entries all
together, add nr argument in mem_cgroup_swapin_uncharge_swap().

For the existing two users, just pass nr=1.

Link: https://lkml.kernel.org/r/20240908232119.2157-3-21cnbao@gmail.com
Signed-off-by: Barry Song &lt;v-songbaohua@oppo.com&gt;
Acked-by: Chris Li &lt;chrisl@kernel.org&gt;
Reviewed-by: Yosry Ahmed &lt;yosryahmed@google.com&gt;
Cc: Shakeel Butt &lt;shakeel.butt@linux.dev&gt;
Cc: Baolin Wang &lt;baolin.wang@linux.alibaba.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Gao Xiang &lt;xiang@kernel.org&gt;
Cc: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Kairui Song &lt;kasong@tencent.com&gt;
Cc: Kairui Song &lt;ryncsn@gmail.com&gt;
Cc: Kalesh Singh &lt;kaleshsingh@google.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Nhat Pham &lt;nphamcs@gmail.com&gt;
Cc: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Cc: Sergey Senozhatsky &lt;senozhatsky@chromium.org&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: Yang Shi &lt;shy828301@gmail.com&gt;
Cc: Chuanhua Han &lt;hanchuanhua@oppo.com&gt;
Cc: Kanchana P Sridhar &lt;kanchana.p.sridhar@intel.com&gt;
Cc: Usama Arif &lt;usamaarif642@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
With large folios swap-in, we might need to uncharge multiple entries all
together, add nr argument in mem_cgroup_swapin_uncharge_swap().

For the existing two users, just pass nr=1.

Link: https://lkml.kernel.org/r/20240908232119.2157-3-21cnbao@gmail.com
Signed-off-by: Barry Song &lt;v-songbaohua@oppo.com&gt;
Acked-by: Chris Li &lt;chrisl@kernel.org&gt;
Reviewed-by: Yosry Ahmed &lt;yosryahmed@google.com&gt;
Cc: Shakeel Butt &lt;shakeel.butt@linux.dev&gt;
Cc: Baolin Wang &lt;baolin.wang@linux.alibaba.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Gao Xiang &lt;xiang@kernel.org&gt;
Cc: "Huang, Ying" &lt;ying.huang@intel.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Kairui Song &lt;kasong@tencent.com&gt;
Cc: Kairui Song &lt;ryncsn@gmail.com&gt;
Cc: Kalesh Singh &lt;kaleshsingh@google.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Nhat Pham &lt;nphamcs@gmail.com&gt;
Cc: Ryan Roberts &lt;ryan.roberts@arm.com&gt;
Cc: Sergey Senozhatsky &lt;senozhatsky@chromium.org&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: Yang Shi &lt;shy828301@gmail.com&gt;
Cc: Chuanhua Han &lt;hanchuanhua@oppo.com&gt;
Cc: Kanchana P Sridhar &lt;kanchana.p.sridhar@intel.com&gt;
Cc: Usama Arif &lt;usamaarif642@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: support poison recovery from copy_present_page()</title>
<updated>2024-09-17T08:07:00+00:00</updated>
<author>
<name>Kefeng Wang</name>
<email>wangkefeng.wang@huawei.com</email>
</author>
<published>2024-09-06T02:42:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=658be46520ce480a44fe405730a1725166298f27'/>
<id>658be46520ce480a44fe405730a1725166298f27</id>
<content type='text'>
Similar to other poison recovery, use copy_mc_user_highpage() to avoid
potentially kernel panic during copy page in copy_present_page() from
fork, once copy failed due to hwpoison in source page, we need to break
out of copy in copy_pte_range() and release prealloc folio, so
copy_mc_user_highpage() is moved ahead before set *prealloc to NULL.

Link: https://lkml.kernel.org/r/20240906024201.1214712-3-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang &lt;wangkefeng.wang@huawei.com&gt;
Reviewed-by: Jane Chu &lt;jane.chu@oracle.com&gt;
Reviewed-by: Miaohe Lin &lt;linmiaohe@huawei.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Jiaqi Yan &lt;jiaqiyan@google.com&gt;
Cc: Naoya Horiguchi &lt;nao.horiguchi@gmail.com&gt;
Cc: Tony Luck &lt;tony.luck@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Similar to other poison recovery, use copy_mc_user_highpage() to avoid
potentially kernel panic during copy page in copy_present_page() from
fork, once copy failed due to hwpoison in source page, we need to break
out of copy in copy_pte_range() and release prealloc folio, so
copy_mc_user_highpage() is moved ahead before set *prealloc to NULL.

Link: https://lkml.kernel.org/r/20240906024201.1214712-3-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang &lt;wangkefeng.wang@huawei.com&gt;
Reviewed-by: Jane Chu &lt;jane.chu@oracle.com&gt;
Reviewed-by: Miaohe Lin &lt;linmiaohe@huawei.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Jiaqi Yan &lt;jiaqiyan@google.com&gt;
Cc: Naoya Horiguchi &lt;nao.horiguchi@gmail.com&gt;
Cc: Tony Luck &lt;tony.luck@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: support poison recovery from do_cow_fault()</title>
<updated>2024-09-17T08:07:00+00:00</updated>
<author>
<name>Kefeng Wang</name>
<email>wangkefeng.wang@huawei.com</email>
</author>
<published>2024-09-06T02:42:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=aa549f923f5e037a459dcd588932db9abfa8c158'/>
<id>aa549f923f5e037a459dcd588932db9abfa8c158</id>
<content type='text'>
Patch series "mm: hwpoison: two more poison recovery".

One more CoW path to support poison recorvery in do_cow_fault(), and the
last copy_user_highpage() user is replaced to copy_mc_user_highpage() from
copy_present_page() during fork to support poison recorvery too.


This patch (of 2):

Like commit a873dfe1032a ("mm, hwpoison: try to recover from copy-on
write faults"), there is another path which could crash because it does
not have recovery code where poison is consumed by the kernel in
do_cow_fault(), a crash calltrace shown below on old kernel, but it
could be happened in the lastest mainline code,

  CPU: 7 PID: 3248 Comm: mpi Kdump: loaded Tainted: G           OE     5.10.0 #1
  pc : copy_page+0xc/0xbc
  lr : copy_user_highpage+0x50/0x9c
  Call trace:
    copy_page+0xc/0xbc
    do_cow_fault+0x118/0x2bc
    do_fault+0x40/0x1a4
    handle_pte_fault+0x154/0x230
    __handle_mm_fault+0x1a8/0x38c
    handle_mm_fault+0xf0/0x250
    do_page_fault+0x184/0x454
    do_translation_fault+0xac/0xd4
    do_mem_abort+0x44/0xbc

Fix it by using copy_mc_user_highpage() to handle this case and return
VM_FAULT_HWPOISON for cow fault.

[wangkefeng.wang@huawei.com: unlock/put vmf-&gt;page, per Miaohe]
  Link: https://lkml.kernel.org/r/20240910021541.234300-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20240906024201.1214712-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20240906024201.1214712-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang &lt;wangkefeng.wang@huawei.com&gt;
Reviewed-by: Jane Chu &lt;jane.chu@oracle.com&gt;
Reviewed-by: Miaohe Lin &lt;linmiaohe@huawei.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Jiaqi Yan &lt;jiaqiyan@google.com&gt;
Cc: Naoya Horiguchi &lt;nao.horiguchi@gmail.com&gt;
Cc: Tony Luck &lt;tony.luck@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Patch series "mm: hwpoison: two more poison recovery".

One more CoW path to support poison recorvery in do_cow_fault(), and the
last copy_user_highpage() user is replaced to copy_mc_user_highpage() from
copy_present_page() during fork to support poison recorvery too.


This patch (of 2):

Like commit a873dfe1032a ("mm, hwpoison: try to recover from copy-on
write faults"), there is another path which could crash because it does
not have recovery code where poison is consumed by the kernel in
do_cow_fault(), a crash calltrace shown below on old kernel, but it
could be happened in the lastest mainline code,

  CPU: 7 PID: 3248 Comm: mpi Kdump: loaded Tainted: G           OE     5.10.0 #1
  pc : copy_page+0xc/0xbc
  lr : copy_user_highpage+0x50/0x9c
  Call trace:
    copy_page+0xc/0xbc
    do_cow_fault+0x118/0x2bc
    do_fault+0x40/0x1a4
    handle_pte_fault+0x154/0x230
    __handle_mm_fault+0x1a8/0x38c
    handle_mm_fault+0xf0/0x250
    do_page_fault+0x184/0x454
    do_translation_fault+0xac/0xd4
    do_mem_abort+0x44/0xbc

Fix it by using copy_mc_user_highpage() to handle this case and return
VM_FAULT_HWPOISON for cow fault.

[wangkefeng.wang@huawei.com: unlock/put vmf-&gt;page, per Miaohe]
  Link: https://lkml.kernel.org/r/20240910021541.234300-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20240906024201.1214712-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20240906024201.1214712-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang &lt;wangkefeng.wang@huawei.com&gt;
Reviewed-by: Jane Chu &lt;jane.chu@oracle.com&gt;
Reviewed-by: Miaohe Lin &lt;linmiaohe@huawei.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Jiaqi Yan &lt;jiaqiyan@google.com&gt;
Cc: Naoya Horiguchi &lt;nao.horiguchi@gmail.com&gt;
Cc: Tony Luck &lt;tony.luck@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
