linux-stable.git/mm/vmstat.c, branch v4.9.321

mm, vmstat: drop zone->lock in /proc/pagetypeinfo

2021-06-03T06:23:27+00:00

Commit 93b3a674485f6a4b8ffff85d1682d5e8b7c51560 upstream

Commit 93b3a674485f ("mm,vmstat: reduce zone->lock holding time by
/proc/pagetypeinfo") upstream caps the number of iterations over each
free_list at 100,000, and also drops the zone->lock in between each
migrate type. Capping the iteration count alters the file contents in
some cases, which means this approach may not be suitable for stable
backports.

However, dropping zone->lock in between migrate types (and, as a result,
page orders) will not change the /proc/pagetypeinfo file contents. It
can significantly reduce the length of time spent with IRQs disabled,
which can prevent missed interrupts or soft lockups which we have
observed on systems with particularly large memory.

Thus, this commit is a modified version of the upstream one which only
drops the lock in between migrate types.

Fixes: 467c996c1e19 ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
Signed-off-by: Stephen Brennan 
Reviewed-by: Aruna Ramakrishna 
Reviewed-by: Khalid Aziz 
Signed-off-by: Greg Kroah-Hartman

mm, vmstat: hide /proc/pagetypeinfo from normal users

2019-11-12T18:15:43+00:00

commit abaed0112c1db08be15a784a2c5c8a8b3063cdd3 upstream.

/proc/pagetypeinfo is a debugging tool to examine internal page
allocator state wrt to fragmentation.  It is not very useful for any
other use so normal users really do not need to read this file.

Waiman Long has noticed that reading this file can have negative side
effects because zone->lock is necessary for gathering data and that a)
interferes with the page allocator and its users and b) can lead to hard
lockups on large machines which have very long free_list.

Reduce both issues by simply not exporting the file to regular users.

Link: http://lkml.kernel.org/r/20191025072610.18526-2-mhocko@kernel.org
Fixes: 467c996c1e19 ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
Signed-off-by: Michal Hocko 
Reported-by: Waiman Long 
Acked-by: Mel Gorman 
Acked-by: Vlastimil Babka 
Acked-by: Waiman Long 
Acked-by: Rafael Aquini 
Acked-by: David Rientjes 
Reviewed-by: Andrew Morton 
Cc: David Hildenbrand 
Cc: Johannes Weiner 
Cc: Roman Gushchin 
Cc: Konstantin Khlebnikov 
Cc: Jann Horn 
Cc: Song Liu 
Cc: Greg Kroah-Hartman 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n

2019-04-27T07:34:46+00:00

commit e8277b3b52240ec1caad8e6df278863e4bf42eac upstream.

Commit 58bc4c34d249 ("mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly")
depends on skipping vmstat entries with empty name introduced in
7aaf77272358 ("mm: don't show nr_indirectly_reclaimable in
/proc/vmstat") but reverted in b29940c1abd7 ("mm: rename and change
semantics of nr_indirectly_reclaimable_bytes").

So skipping no longer works and /proc/vmstat has misformatted lines " 0".

This patch simply shows debug counters "nr_tlb_remote_*" for UP.

Link: http://lkml.kernel.org/r/155481488468.467.4295519102880913454.stgit@buzz
Fixes: 58bc4c34d249 ("mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly")
Signed-off-by: Konstantin Khlebnikov 
Acked-by: Vlastimil Babka 
Cc: Roman Gushchin 
Cc: Jann Horn 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm/vmstat.c: fix outdated vmstat_text

2018-10-18T07:13:21+00:00

commit 28e2c4bb99aa40f9d5f07ac130cbc4da0ea93079 upstream.

7a9cdebdcc17 ("mm: get rid of vmacache_flush_all() entirely") removed the
VMACACHE_FULL_FLUSHES statistics, but didn't remove the corresponding
entry in vmstat_text.  This causes an out-of-bounds access in
vmstat_show().

Luckily this only affects kernels with CONFIG_DEBUG_VM_VMACACHE=y, which
is probably very rare.

Link: http://lkml.kernel.org/r/20181001143138.95119-1-jannh@google.com
Fixes: 7a9cdebdcc17 ("mm: get rid of vmacache_flush_all() entirely")
Signed-off-by: Jann Horn 
Reviewed-by: Kees Cook 
Reviewed-by: Andrew Morton 
Acked-by: Michal Hocko 
Acked-by: Roman Gushchin 
Cc: Davidlohr Bueso 
Cc: Oleg Nesterov 
Cc: Christoph Lameter 
Cc: Kemi Wang 
Cc: Andy Lutomirski 
Cc: Ingo Molnar 
Signed-off-by: Andrew Morton 
Signed-off-by: Greg Kroah-Hartman

mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly

2018-10-13T07:18:53+00:00

commit 58bc4c34d249bf1bc50730a9a209139347cfacfe upstream.

5dd0b16cdaff ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even
on UP") made the availability of the NR_TLB_REMOTE_FLUSH* counters inside
the kernel unconditional to reduce #ifdef soup, but (either to avoid
showing dummy zero counters to userspace, or because that code was missed)
didn't update the vmstat_array, meaning that all following counters would
be shown with incorrect values.

This only affects kernel builds with
CONFIG_VM_EVENT_COUNTERS=y && CONFIG_DEBUG_TLBFLUSH=y && CONFIG_SMP=n.

Link: http://lkml.kernel.org/r/20181001143138.95119-2-jannh@google.com
Fixes: 5dd0b16cdaff ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UP")
Signed-off-by: Jann Horn 
Reviewed-by: Kees Cook 
Reviewed-by: Andrew Morton 
Acked-by: Michal Hocko 
Acked-by: Roman Gushchin 
Cc: Davidlohr Bueso 
Cc: Oleg Nesterov 
Cc: Christoph Lameter 
Cc: Kemi Wang 
Cc: Andy Lutomirski 
Cc: Ingo Molnar 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Greg Kroah-Hartman

mm, vmstat: Remove spurious WARN() during zoneinfo print

2018-04-13T17:47:55+00:00

[ Upstream commit 8d35bb310698c69d73073b26fc581f2e3f7f621d ]

After commit e2ecc8a79ed4 ("mm, vmstat: print non-populated zones in
zoneinfo"), /proc/zoneinfo will show unpopulated zones.

A memoryless node, having no populated zones at all, was previously
ignored, but will now trigger the WARN() in is_zone_first_populated().

Remove this warning, as its only purpose was to warn of a situation that
has since been enabled.

Aside: The "per-node stats" are still printed under the first populated
zone, but that's not necessarily the first stanza any more.  I'm not
sure which criteria is more important with regard to not breaking
parsers, but it looks a little weird to the eye.

Fixes:  e2ecc8a79ed4 ("mm, vmstat: print node-based stats in zoneinfo file")
Link: http://lkml.kernel.org/r/1493854905-10918-1-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab 
Cc: David Rientjes 
Cc: Anshuman Khandual 
Cc: Vlastimil Babka 
Cc: Mel Gorman 
Cc: Johannes Weiner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

mm, vmstat: suppress pcp stats for unpopulated zones in zoneinfo

2018-03-24T10:00:21+00:00

[ Upstream commit 7dfb8bf3b9caef4049bee51d2c22e1c3a311d483 ]

After "mm, vmstat: print non-populated zones in zoneinfo",
/proc/zoneinfo will show unpopulated zones.

The per-cpu pageset statistics are not relevant for unpopulated zones
and can be potentially lengthy, so supress them when they are not
interesting.

Also moves lowmem reserve protection information above pcp stats since
it is relevant for all zones per vm.lowmem_reserve_ratio.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1703061400500.46428@chino.kir.corp.google.com
Signed-off-by: David Rientjes 
Cc: Anshuman Khandual 
Cc: Vlastimil Babka 
Cc: Mel Gorman 
Cc: Johannes Weiner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

mm: fix 100% CPU kswapd busyloop on unreclaimable nodes

2018-01-31T11:55:53+00:00

commit c73322d098e4b6f5f0f0fa1330bf57e218775539 upstream.

Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
cleanups".

Jia reported a scenario in which the kswapd of a node indefinitely spins
at 100% CPU usage.  We have seen similar cases at Facebook.

The kernel's current method of judging its ability to reclaim a node (or
whether to back off and sleep) is based on the amount of scanned pages
in proportion to the amount of reclaimable pages.  In Jia's and our
scenarios, there are no reclaimable pages in the node, however, and the
condition for backing off is never met.  Kswapd busyloops in an attempt
to restore the watermarks while having nothing to work with.

This series reworks the definition of an unreclaimable node based not on
scanning but on whether kswapd is able to actually reclaim pages in
MAX_RECLAIM_RETRIES (16) consecutive runs.  This is the same criteria
the page allocator uses for giving up on direct reclaim and invoking the
OOM killer.  If it cannot free any pages, kswapd will go to sleep and
leave further attempts to direct reclaim invocations, which will either
make progress and re-enable kswapd, or invoke the OOM killer.

Patch #1 fixes the immediate problem Jia reported, the remainder are
smaller fixlets, cleanups, and overall phasing out of the old method.

Patch #6 is the odd one out.  It's a nice cleanup to get_scan_count(),
and directly related to #5, but in itself not relevant to the series.

If the whole series is too ambitious for 4.11, I would consider the
first three patches fixes, the rest cleanups.

This patch (of 9):

Jia He reports a problem with kswapd spinning at 100% CPU when
requesting more hugepages than memory available in the system:

$ echo 4000 >/proc/sys/vm/nr_hugepages

top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3

At that time, there are no reclaimable pages left in the node, but as
kswapd fails to restore the high watermarks it refuses to go to sleep.

Kswapd needs to back away from nodes that fail to balance.  Up until
commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of
nodes") kswapd had such a mechanism.  It considered zones whose
theoretically reclaimable pages it had reclaimed six times over as
unreclaimable and backed away from them.  This guard was erroneously
removed as the patch changed the definition of a balanced node.

However, simply restoring this code wouldn't help in the case reported
here: there *are* no reclaimable pages that could be scanned until the
threshold is met.  Kswapd would stay awake anyway.

Introduce a new and much simpler way of backing off.  If kswapd runs
through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
page, make it back off from the node.  This is the same number of shots
direct reclaim takes before declaring OOM.  Kswapd will go to sleep on
that node until a direct reclaimer manages to reclaim some pages, thus
proving the node reclaimable again.

[hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
  Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
[shakeelb@google.com: fix condition for throttle_direct_reclaim]
  Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner 
Signed-off-by: Shakeel Butt 
Reported-by: Jia He 
Tested-by: Jia He 
Acked-by: Michal Hocko 
Acked-by: Hillf Danton 
Acked-by: Minchan Kim 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Cc: Dmitry Shmidt 
Signed-off-by: Greg Kroah-Hartman

kaiser: vmstat show NR_KAISERTABLE as nr_overhead

2018-01-05T14:46:33+00:00

The kaiser update made an interesting choice, never to free any shadow
page tables.  Contention on global spinlock was worrying, particularly
with it held across page table scans when freeing.  Something had to be
done: I was going to add refcounting; but simply never to free them is
an appealing choice, minimizing contention without complicating the code
(the more a page table is found already, the less the spinlock is used).

But leaking pages in this way is also a worry: can we get away with it?
At the very least, we need a count to show how bad it actually gets:
in principle, one might end up wasting about 1/256 of memory that way
(1/512 for when direct-mapped pages have to be user-mapped, plus 1/512
for when they are user-mapped from the vmalloc area on another occasion
(but we don't have vmalloc'ed stacks, so only large ldts are vmalloc'ed).

Add per-cpu stat NR_KAISERTABLE: including 256 at startup for the
shared pgd entries, and 1 for each intermediate page table added
thereafter for user-mapping - but leave out the 1 per mm, for its
shadow pgd, because that distracts from the monotonic increase.
Shown in /proc/vmstat as nr_overhead (0 if kaiser not enabled).

In practice, it doesn't look so bad so far: more like 1/12000 after
nine hours of gtests below; and movable pageblock segregation should
tend to cluster the kaiser tables into a subset of the address space
(if not, they will be bad for compaction too).  But production may
tell a different story: keep an eye on this number, and bring back
lighter freeing if it gets out of control (maybe a shrinker).

["nr_overhead" should of course say "nr_kaisertable", if it needs
to stay; but for the moment we are being coy, preferring that when
Joe Blow notices a new line in his /proc/vmstat, he does not get
too curious about what this "kaiser" stuff might be.]

Signed-off-by: Hugh Dickins 
Signed-off-by: Greg Kroah-Hartman

seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char

2016-10-08T01:46:30+00:00

Allow some seq_puts removals by taking a string instead of a single
char.

[akpm@linux-foundation.org: update vmstat_show(), per Joe]
Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.com
Signed-off-by: Joe Perches 
Cc: Joe Perches 
Cc: Andi Kleen 
Cc: Al Viro 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds