linux.git/mm/vmstat.c, branch v6.9-rc2

mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER

2024-01-08T23:27:15+00:00

commit 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") has
changed the definition of MAX_ORDER to be inclusive.  This has caused
issues with code that was not yet upstream and depended on the previous
definition.

To draw attention to the altered meaning of the define, rename MAX_ORDER
to MAX_PAGE_ORDER.

Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov 
Cc: Linus Torvalds 
Signed-off-by: Andrew Morton

mm, treewide: introduce NR_PAGE_ORDERS

2024-01-08T23:27:15+00:00

NR_PAGE_ORDERS defines the number of page orders supported by the page
allocator, ranging from 0 to MAX_ORDER, MAX_ORDER + 1 in total.

NR_PAGE_ORDERS assists in defining arrays of page orders and allows for
more natural iteration over them.

[kirill.shutemov@linux.intel.com: fixup for kerneldoc warning]
  Link: https://lkml.kernel.org/r/20240101111512.7empzyifq7kxtzk3@box
Link: https://lkml.kernel.org/r/20231228144704.14033-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Zi Yan 
Cc: Linus Torvalds 
Signed-off-by: Andrew Morton

mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING

2024-01-05T18:17:47+00:00

Demotion can work well without CONFIG_NUMA_BALANCING.  But the commit
23e9f0138963 ("mm/vmstat: move pgdemote_* to per-node stats") wrongly hid
it behind CONFIG_NUMA_BALANCING.

Fix it by moving them out of CONFIG_NUMA_BALANCING.

Link: https://lkml.kernel.org/r/20231229022651.3229174-1-lizhijian@fujitsu.com
Fixes: 23e9f0138963 ("mm/vmstat: move pgdemote_* to per-node stats")
Signed-off-by: Li Zhijian 
Cc: "Huang, Ying" 
Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Signed-off-by: Andrew Morton

mm: memcg: add per-memcg zswap writeback stat

2023-12-12T18:57:02+00:00

Since zswap now writes back pages from memcg-specific LRUs, we now need a
new stat to show writebacks count for each memcg.

[nphamcs@gmail.com: rename ZSWP_WB to ZSWPWB]
  Link: https://lkml.kernel.org/r/20231205193307.2432803-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-5-nphamcs@gmail.com
Suggested-by: Nhat Pham 
Signed-off-by: Domenico Cerasuolo 
Signed-off-by: Nhat Pham 
Tested-by: Bagas Sanjaya 
Reviewed-by: Yosry Ahmed 
Cc: Chris Li 
Cc: Dan Streetman 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Muchun Song 
Cc: Roman Gushchin 
Cc: Seth Jennings 
Cc: Shakeel Butt 
Cc: Shuah Khan 
Cc: Vitaly Wool 
Signed-off-by: Andrew Morton

mm/vmstat: move pgdemote_* to per-node stats

2023-12-11T00:51:31+00:00

Demotion will migrate pages across nodes.  Previously, only the global
demotion statistics were accounted for.  Changed them to per-node
statistics, making it easier to observe where demotion occurs on each
node.

This will help to identify which nodes are under pressure.

This patch also make pgdemote_* behind CONFIG_NUMA_BALANCING, since
demotion is not available for !CONFIG_NUMA_BALANCING

With this patch, here is a sample where node0 node1 are DRAM,
node3 is PMEM:
Global stats:
$ grep demote /proc/vmstat
pgdemote_kswapd 254288
pgdemote_direct 113497
pgdemote_khugepaged 0

Per-node stats:
$ grep demote /sys/devices/system/node/node0/vmstat # demotion source
pgdemote_kswapd 68454
pgdemote_direct 83431
pgdemote_khugepaged 0
$ grep demote /sys/devices/system/node/node1/vmstat # demotion source
pgdemote_kswapd 185834
pgdemote_direct 30066
pgdemote_khugepaged 0
$ grep demote /sys/devices/system/node/node3/vmstat # demotion target
pgdemote_kswapd 0
pgdemote_direct 0
pgdemote_khugepaged 0

Link: https://lkml.kernel.org/r/20231103031450.1456523-1-lizhijian@fujitsu.com
Signed-off-by: Li Zhijian 
Acked-by: "Huang, Ying" 
Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Signed-off-by: Andrew Morton

mm: tune PCP high automatically

2023-10-25T23:47:10+00:00

The target to tune PCP high automatically is as follows,

- Minimize allocation/freeing from/to shared zone

- Minimize idle pages in PCP

- Minimize pages in PCP if the system free pages is too few

To reach these target, a tuning algorithm as follows is designed,

- When we refill PCP via allocating from the zone, increase PCP high.
  Because if we had larger PCP, we could avoid to allocate from the
  zone.

- In periodic vmstat updating kworker (via refresh_cpu_vm_stats()),
  decrease PCP high to try to free possible idle PCP pages.

- When page reclaiming is active for the zone, stop increasing PCP
  high in allocating path, decrease PCP high and free some pages in
  freeing path.

So, the PCP high can be tuned to the page allocating/freeing depth of
workloads eventually.

One issue of the algorithm is that if the number of pages allocated is
much more than that of pages freed on a CPU, the PCP high may become the
maximal value even if the allocating/freeing depth is small.  But this
isn't a severe issue, because there are no idle pages in this case.

One alternative choice is to increase PCP high when we drain PCP via
trying to free pages to the zone, but don't increase PCP high during PCP
refilling.  This can avoid the issue above.  But if the number of pages
allocated is much less than that of pages freed on a CPU, there will be
many idle pages in PCP and it is hard to free these idle pages.

1/8 (>> 3) of PCP high will be decreased periodically.  The value 1/8 is
kind of arbitrary.  Just to make sure that the idle PCP pages will be
freed eventually.

On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
in parallel (each with `make -j 28`) in 8 cgroup.  This simulates the
kbuild server that is used by 0-Day kbuild service.  With the patch, the
build time decreases 3.5%.  The cycles% of the spinlock contention (mostly
for zone lock) decreases from 11.0% to 0.5%.  The number of PCP draining
for high order pages freeing (free_high) decreases 65.6%.  The number of
pages allocated from zone (instead of from PCP) decreases 83.9%.

Link: https://lkml.kernel.org/r/20231016053002.756205-8-ying.huang@intel.com
Signed-off-by: "Huang, Ying" 
Suggested-by: Mel Gorman 
Suggested-by: Michal Hocko 
Cc: Vlastimil Babka 
Cc: David Hildenbrand 
Cc: Johannes Weiner 
Cc: Dave Hansen 
Cc: Pavel Tatashin 
Cc: Matthew Wilcox 
Cc: Christoph Lameter 
Cc: Arjan van de Ven 
Cc: Sudeep Holla 
Signed-off-by: Andrew Morton

mm: fix draining remote pageset

2023-10-25T23:47:07+00:00

If there is no memory allocation/freeing in the PCP (Per-CPU Pageset) of a
remote zone (zone in remote NUMA node) after some time (3 seconds for
now), the pages of the PCP of the remote zone will be drained to avoid
memory wastage.

This behavior was introduced in the commit 4ae7c03943fc ("[PATCH]
Periodically drain non local pagesets") and the commit 4037d452202e ("Move
remote node draining out of slab allocators")

But, after the commit 7cc36bbddde5 ("vmstat: on-demand vmstat workers
V8"), the vmstat updater worker which is used to drain the PCP of remote
zones may not be re-queued when we are waiting for the timeout
(pcp->expire != 0) if there are no vmstat changes on this CPU, for
example, when the CPU goes idle or runs user space only workloads.  This
may cause the pages of a remote zone be kept in PCP of this CPU for long
time.  So that, the page reclaiming of the remote zone may be triggered
prematurely.  This isn't a severe problem in practice, because the PCP of
the remote zone will be drained if some memory are allocated/freed again
on this CPU.  And, the PCP will eventually be drained during the direct
reclaiming if necessary.

Anyway, the problem still deserves a fix via guaranteeing that the vmstat
updater worker will always be re-queued when we are waiting for the
timeout.  In effect, this restores the original behavior before the commit
7cc36bbddde5.

We can reproduce the bug via allocating/freeing pages from a remote zone
then go idle as follows.  And the patch can fix it.

- Run some workloads, use `numactl` to bind CPU to node 0 and memory to
  node 1.  So the PCP of the CPU on node 0 for zone on node 1 will be
  filled.

- After workloads finish, idle for 60s

- Check /proc/zoneinfo

With the original kernel, the number of pages in the PCP of the CPU on
node 0 for zone on node 1 is non-zero after idle.  With the patched
kernel, it becomes 0 after idle.  That is, we avoid to keep pages in the
remote PCP during idle.

Link: https://lkml.kernel.org/r/20231007062356.187621-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20230811090819.60845-1-ying.huang@intel.com
Fixes: 7cc36bbddde5 ("vmstat: on-demand vmstat workers V8")
Signed-off-by: "Huang, Ying" 
Reviewed-by: Christoph Lameter 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: Michal Hocko 
Signed-off-by: Andrew Morton

mm/vmstat: use this_cpu_try_cmpxchg in mod_{zone,node}_state

2023-10-04T17:32:20+00:00

Use this_cpu_try_cmpxchg instead of this_cpu_cmpxchg (*ptr, old, new) ==
old in mod_zone_state and mod_node_state.  x86 CMPXCHG instruction returns
success in ZF flag, so this change saves a compare after cmpxchg (and
related move instruction in front of cmpxchg).

Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
fails.  There is no need to re-read the value in the loop.

No functional change intended.

Link: https://lkml.kernel.org/r/20230904150917.8318-1-ubizjak@gmail.com
Signed-off-by: Uros Bizjak 
Signed-off-by: Andrew Morton

mm/vmstat: remove unused page_ext.h from vmstat

2023-08-21T20:37:30+00:00

No page_ext function or structure is used in vmstat.  Just remove page_ext
header from vmstat.

Link: https://lkml.kernel.org/r/20230717113227.1897173-3-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi 
Signed-off-by: Andrew Morton

Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

2023-06-28T17:28:11+00:00

Pull mm updates from Andrew Morton:

 - Yosry Ahmed brought back some cgroup v1 stats in OOM logs

 - Yosry has also eliminated cgroup's atomic rstat flushing

 - Nhat Pham adds the new cachestat() syscall. It provides userspace
   with the ability to query pagecache status - a similar concept to
   mincore() but more powerful and with improved usability

 - Mel Gorman provides more optimizations for compaction, reducing the
   prevalence of page rescanning

 - Lorenzo Stoakes has done some maintanance work on the
   get_user_pages() interface

 - Liam Howlett continues with cleanups and maintenance work to the
   maple tree code. Peng Zhang also does some work on maple tree

 - Johannes Weiner has done some cleanup work on the compaction code

 - David Hildenbrand has contributed additional selftests for
   get_user_pages()

 - Thomas Gleixner has contributed some maintenance and optimization
   work for the vmalloc code

 - Baolin Wang has provided some compaction cleanups,

 - SeongJae Park continues maintenance work on the DAMON code

 - Huang Ying has done some maintenance on the swap code's usage of
   device refcounting

 - Christoph Hellwig has some cleanups for the filemap/directio code

 - Ryan Roberts provides two patch series which yield some
   rationalization of the kernel's access to pte entries - use the
   provided APIs rather than open-coding accesses

 - Lorenzo Stoakes has some fixes to the interaction between pagecache
   and directio access to file mappings

 - John Hubbard has a series of fixes to the MM selftesting code

 - ZhangPeng continues the folio conversion campaign

 - Hugh Dickins has been working on the pagetable handling code, mainly
   with a view to reducing the load on the mmap_lock

 - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
   from 128 to 8

 - Domenico Cerasuolo has improved the zswap reclaim mechanism by
   reorganizing the LRU management

 - Matthew Wilcox provides some fixups to make gfs2 work better with the
   buffer_head code

 - Vishal Moola also has done some folio conversion work

 - Matthew Wilcox has removed the remnants of the pagevec code - their
   functionality is migrated over to struct folio_batch

* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
  mm/hugetlb: remove hugetlb_set_page_subpool()
  mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
  hugetlb: revert use of page_cache_next_miss()
  Revert "page cache: fix page_cache_next/prev_miss off by one"
  mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
  mm: memcg: rename and document global_reclaim()
  mm: kill [add|del]_page_to_lru_list()
  mm: compaction: convert to use a folio in isolate_migratepages_block()
  mm: zswap: fix double invalidate with exclusive loads
  mm: remove unnecessary pagevec includes
  mm: remove references to pagevec
  mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
  mm: remove struct pagevec
  net: convert sunrpc from pagevec to folio_batch
  i915: convert i915_gpu_error to use a folio_batch
  pagevec: rename fbatch_count()
  mm: remove check_move_unevictable_pages()
  drm: convert drm_gem_put_pages() to use a folio_batch
  i915: convert shmem_sg_free_table() to use a folio_batch
  scatterlist: add sg_set_folio()
  ...