linux-stable.git/mm/vmstat.c, branch v5.13.2

mm: fix typos in comments

2021-05-07T07:26:35+00:00

Fix ~94 single-word typos in locking code comments, plus a few
very obvious grammar mistakes.

Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com
Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com
Signed-off-by: Ingo Molnar 
Reviewed-by: Matthew Wilcox (Oracle) 
Reviewed-by: Randy Dunlap 
Cc: Bhaskar Chowdhury 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

x86/mm: track linear mapping split events

2021-05-05T18:27:25+00:00

To help with debugging the sluggishness caused by TLB miss/reload, we
introduce monotonic hugepage [direct mapped] split event counts since
system state: SYSTEM_RUNNING to be displayed as part of /proc/vmstat in
x86 servers

The lifetime split event information will be displayed at the bottom of
/proc/vmstat
  ....
  swap_ra 0
  swap_ra_hit 0
  direct_map_level2_splits 94
  direct_map_level3_splits 4
  nr_unstable 0
  ....

One of the many lasting sources of direct hugepage splits is kernel
tracing (kprobes, tracepoints).

Note that the kernel's code segment [512 MB] points to the same physical
addresses that have been already mapped in the kernel's direct mapping
range.

Source : Documentation/x86/x86_64/mm.rst

When we enable kernel tracing, the kernel has to modify
attributes/permissions of the text segment hugepages that are direct
mapped causing them to split.

Kernel's direct mapped hugepages do not coalesce back after split and
remain in place for the remainder of the lifetime.

An instance of direct page splits when we turn on dynamic kernel tracing
....
cat /proc/vmstat | grep -i direct_map_level
direct_map_level2_splits 784
direct_map_level3_splits 12
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @ [pid, comm] =
count(); }'
cat /proc/vmstat | grep -i
direct_map_level
direct_map_level2_splits 789
direct_map_level3_splits 12
....

Link: https://lkml.kernel.org/r/20210218235744.1040634-1-saravanand@fb.com
Signed-off-by: Saravanan D 
Acked-by: Tejun Heo 
Acked-by: Johannes Weiner 
Acked-by: Dave Hansen 
Cc: Ingo Molnar 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: /proc/sys/vm/stat_refresh stop checking monotonic numa stats

2021-05-05T18:27:25+00:00

All of the VM NUMA stats are event counts, incremented never
decremented: it is not very useful for vmstat_refresh() to check them
throughout their first aeon, then warn on them throughout their next.

Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2102251514110.13363@eggly.anvils
Signed-off-by: Hugh Dickins 
Acked-by: Roman Gushchin 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: /proc/sys/vm/stat_refresh skip checking known negative stats

2021-05-05T18:27:25+00:00

vmstat_refresh() can occasionally catch nr_zone_write_pending and
nr_writeback when they are transiently negative.  The reason is partly
that the interrupt which decrements them in test_clear_page_writeback()
can come in before __test_set_page_writeback() got to increment them;
but transient negatives are still seen even when that is prevented, and
I am not yet certain why (but see Roman's note below).  Those stats are
not buggy, they have never been seen to drift away from 0 permanently:
so just avoid the annoyance of showing a warning on them.

Similarly avoid showing a warning on nr_free_cma: CMA users have seen
that one reported negative from /proc/sys/vm/stat_refresh too, but it
does drift away permanently: I believe that's because its incrementation
and decrementation are decided by page migratetype, but the migratetype
of a pageblock is not guaranteed to be constant.

Roman Gushchin points out:
 "For performance reasons, vmstat counters are incremented and
  decremented using per-cpu batches. vmstat_refresh() flushes the
  per-cpu batches on all CPUs, to get values as accurate as possible;
  but this method is not atomic, so the resulting value is not always
  precise.

  As a consequence, for those counters whose actual value is close to 0,
  a small negative value may occasionally be reported. If the value is
  small and the state is transient, it is not an indication of an error"

Link: https://lore.kernel.org/linux-mm/20200714173747.3315771-1-guro@fb.com/
Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2103012158540.7549@eggly.anvils
Signed-off-by: Hugh Dickins 
Reported-by: Roman Gushchin 
Acked-by: Roman Gushchin 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: no more EINVAL from /proc/sys/vm/stat_refresh

2021-05-05T18:27:25+00:00

EINVAL was good for drawing the refresher's attention to a warning in
dmesg, but became very tiresome when running test suites scripted with
"set -e": an underflow from a bug in one feature would cause unrelated
tests much later to fail, just because their /proc/sys/vm/stat_refresh
touch failed with that error.  Stop doing that.

Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2102251510410.13363@eggly.anvils
Signed-off-by: Hugh Dickins 
Acked-by: Roman Gushchin 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: restore node stat checking in /proc/sys/vm/stat_refresh

2021-05-05T18:27:25+00:00

In v4.7 commit 52b6f46bc163 ("mm: /proc/sys/vm/stat_refresh to force
vmstat update") introduced vmstat_refresh(), with its vmstat underflow
checking; then in v4.8 commit 75ef71840539 ("mm, vmstat: add
infrastructure for per-node vmstats") split NR_VM_NODE_STAT_ITEMS out of
NR_VM_ZONE_STAT_ITEMS without updating vmstat_refresh(): so it has been
missing out much of the vmstat underflow checking ever since.

Reinstate it.

Thanks to Roman Gushchin  for tangentially pointing this out.

Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2102251502240.13363@eggly.anvils
Signed-off-by: Hugh Dickins 
Cc: Roman Gushchin 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: vmstat: add cma statistics

2021-05-05T18:27:24+00:00

Since CMA is used more widely, it's worth to have CMA allocation
statistics into vmstat.  With it, we could know how agressively system
uses cma allocation and how often it fails.

Link: https://lkml.kernel.org/r/20210302183346.3707237-1-minchan@kernel.org
Signed-off-by: Minchan Kim 
Reviewed-by: John Hubbard 
Cc: John Dias 
Cc: Suren Baghdasaryan 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/vmstat.c: erase latency in vmstat_shepherd

2021-02-26T17:41:00+00:00

Many 100us+ latencies have been deteceted in vmstat_shepherd() on CPX
platform which has 208 logic cpus.  And vmstat_shepherd is queued every
second, which could make the case worse.

Add schedule point in vmstat_shepherd() to erase the latency.

Link: https://lkml.kernel.org/r/20210111035526.1511-1-benbjiang@tencent.com
Signed-off-by: Jiang Biao 
Reported-by: Bin Lai 
Cc: Christoph Lameter 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: vmstat: add some comments on internal storage of byte items

2021-02-26T17:41:00+00:00

Byte-accounted items are used for slab object accounting at the cgroup
level, because the objects in a slab page can belong to different cgroups.
At the global level these items always change in multiples of whole slab
pages.  The vmstat code exploits this and stores these items as pages
internally, which allows for more compact per-cpu data.

This optimization isn't self-evident from the asserts and the division in
the stat update functions.  Provide the reader with some context.

Link: https://lkml.kernel.org/r/20210202184411.118614-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: vmstat: fix NOHZ wakeups for node stat changes

2021-02-26T17:41:00+00:00

On NOHZ, the periodic vmstat flushers on each CPU can go to sleep and
won't wake up until stat changes are detected in the per-cpu deltas of the
zone vmstat counters.

In commit 75ef71840539 ("mm, vmstat: add infrastructure for per-node
vmstats") per-node counters were introduced, and subsequently most stats
were moved from the zone to the node level.  However, the node counters
weren't added to the NOHZ wakeup detection.

In theory this can cause per-cpu errors to remain in the user-reported
stats indefinitely.  In practice this only affects a handful of sub
counters (file_mapped, dirty and writeback e.g.) because other page state
changes at the node level likely involve a change at the zone level as
well (alloc and free, lru ops).  Also, nobody has complained.

Fix it up for completeness: wake up vmstat refreshing on node changes.
Also remove the BUILD_BUG_ONs that assert counter size; we haven't relied
on it since we added sizeof() to the range calculation in commit
13c9aaf7fa01 ("mm/vmstat.c: fix NUMA statistics updates").

Link: https://lkml.kernel.org/r/20210202184342.118513-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds