linux.git/mm/memcontrol.c, branch v6.0

mm: memcontrol: fix potential oom_lock recursion deadlock

2022-07-30T01:07:18+00:00

syzbot is reporting GFP_KERNEL allocation with oom_lock held when
reporting memcg OOM [1].  If this allocation triggers the global OOM
situation then the system can livelock because the GFP_KERNEL
allocation with oom_lock held cannot trigger the global OOM killer
because __alloc_pages_may_oom() fails to hold oom_lock.

Fix this problem by removing the allocation from memory_stat_format()
completely, and pass static buffer when calling from memcg OOM path.

Note that the caller holding filesystem lock was the trigger for syzbot
to report this locking dependency.  Doing GFP_KERNEL allocation with
filesystem lock held can deadlock the system even without involving OOM
situation.

Link: https://syzkaller.appspot.com/bug?extid=2d2aeadc6ce1e1f11d45 [1]
Link: https://lkml.kernel.org/r/86afb39f-8c65-bec2-6cfc-c5e3cd600c0b@I-love.SAKURA.ne.jp
Fixes: c8713d0b23123759 ("mm: memcontrol: dump memory.stat during cgroup OOM")
Signed-off-by: Tetsuo Handa 
Reported-by: syzbot 
Suggested-by: Michal Hocko 
Acked-by: Michal Hocko 
Cc: Johannes Weiner 
Cc: Roman Gushchin 
Cc: Shakeel Butt 
Signed-off-by: Andrew Morton

mm/memcontrol.c: remove the redundant updating of stats_flush_threshold

2022-07-30T01:07:17+00:00

Remove the redundant updating of stats_flush_threshold.  If the global var
stats_flush_threshold has exceeded the trigger value for
__mem_cgroup_flush_stats, further increment is unnecessary.

Apply the patch and test the pts/hackbench-1.0.0 Count:4 (160 threads).

Score gain: 1.95x
Reduce CPU cycles in __mod_memcg_lruvec_state (44.88% -> 0.12%)

CPU: ICX 8380 x 2 sockets
Core number: 40 x 2 physical cores
Benchmark: pts/hackbench-1.0.0 Count:4 (160 threads)

Link: https://lkml.kernel.org/r/20220722164949.47760-1-jiebin.sun@intel.com
Signed-off-by: Jiebin Sun 
Acked-by: Shakeel Butt 
Reviewed-by: Roman Gushchin 
Reviewed-by: Tim Chen 
Acked-by: Muchun Song 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: "Huang, Ying" 
Cc: Amadeusz Sawiski 
Signed-off-by: Andrew Morton

mm: vmpressure: don't count proactive reclaim in vmpressure

2022-07-30T01:07:15+00:00

memory.reclaim is a cgroup v2 interface that allows users to proactively
reclaim memory from a memcg, without real memory pressure.  Reclaim
operations invoke vmpressure, which is used: (a) To notify userspace of
reclaim efficiency in cgroup v1, and (b) As a signal for a memcg being
under memory pressure for networking (see
mem_cgroup_under_socket_pressure()).

For (a), vmpressure notifications in v1 are not affected by this change
since memory.reclaim is a v2 feature.

For (b), the effects of the vmpressure signal (according to Shakeel [1])
are as follows:
1. Reducing send and receive buffers of the current socket.
2. May drop packets on the rx path.
3. May throttle current thread on the tx path.

Since proactive reclaim is invoked directly by userspace, not by memory
pressure, it makes sense not to throttle networking.  Hence, this change
makes sure that proactive reclaim caused by memory.reclaim does not
trigger vmpressure.

[1] https://lore.kernel.org/lkml/CALvZod68WdrXEmBpOkadhB5GPYmCXaDZzXH=yyGOCAjFRn4NDQ@mail.gmail.com/

[yosryahmed@google.com: update documentation]
  Link: https://lkml.kernel.org/r/20220721173015.2643248-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20220714064918.2576464-1-yosryahmed@google.com
Signed-off-by: Yosry Ahmed 
Acked-by: Shakeel Butt 
Acked-by: Michal Hocko 
Acked-by: David Rientjes 
Cc: Johannes Weiner 
Cc: Roman Gushchin 
Cc: Muchun Song 
Cc: Matthew Wilcox 
Cc: Vlastimil Babka 
Cc: David Hildenbrand 
Cc: Miaohe Lin 
Cc: NeilBrown 
Cc: Alistair Popple 
Cc: Suren Baghdasaryan 
Cc: Peter Xu 
Signed-off-by: Andrew Morton

mm: memcontrol: do not miss MEMCG_MAX events for enforced allocations

2022-07-30T01:07:14+00:00

Yafang Shao reported an issue related to the accounting of bpf memory:
if a bpf map is charged indirectly for memory consumed from an
interrupt context and allocations are enforced, MEMCG_MAX events are
not raised.

It's not/less of an issue in a generic case because consequent
allocations from a process context will trigger the direct reclaim and
MEMCG_MAX events will be raised.  However a bpf map can belong to a
dying/abandoned memory cgroup, so there will be no allocations from a
process context and no MEMCG_MAX events will be triggered.

Link: https://lkml.kernel.org/r/20220702033521.64630-1-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin 
Reported-by: Yafang Shao 
Acked-by: Shakeel Butt 
Acked-by: Michal Hocko 
Cc: Johannes Weiner 
Cc: Muchun Song 
Signed-off-by: Andrew Morton

mm/memcontrol.c: replace cgroup_memory_nokmem with mem_cgroup_kmem_disabled()

2022-07-18T00:14:36+00:00

mem_cgroup_kmem_disabled() checks whether the kmem accounting is off. 
Therefore, replace cgroup_memory_nokmem with mem_cgroup_kmem_disabled(),
which is the same work in percpu.c and slab_common.c.

Link: https://lkml.kernel.org/r/20220625061844.226764-1-xiangyang3@huawei.com
Signed-off-by: Xiang Yang 
Reviewed-by: Muchun Song 
Acked-by: Roman Gushchin 
Acked-by: Souptick Joarder (HPE) 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Shakeel Butt 
Signed-off-by: Andrew Morton

mm: add zone device coherent type memory support

2022-07-18T00:14:27+00:00

Device memory that is cache coherent from device and CPU point of view. 
This is used on platforms that have an advanced system bus (like CAPI or
CXL).  Any page of a process can be migrated to such memory.  However, no
one should be allowed to pin such memory so that it can always be evicted.

[hch@lst.de: rebased ontop of the refcount changes, remove is_dev_private_or_coherent_page]
Link: https://lkml.kernel.org/r/20220715150521.18165-4-alex.sierra@amd.com
Signed-off-by: Alex Sierra 
Signed-off-by: Christoph Hellwig 
Acked-by: Felix Kuehling 
Reviewed-by: Alistair Popple 
Acked-by: David Hildenbrand 
Cc: Jason Gunthorpe 
Cc: Jerome Glisse 
Cc: Matthew Wilcox 
Cc: Ralph Campbell 
Signed-off-by: Andrew Morton

mm: memcontrol: introduce mem_cgroup_ino() and mem_cgroup_get_from_ino()

2022-07-04T01:08:40+00:00

Patch series "mm: introduce shrinker debugfs interface", v5.

The only existing debugging mechanism is a couple of tracepoints in
do_shrink_slab(): mm_shrink_slab_start and mm_shrink_slab_end.  They
aren't covering everything though: shrinkers which report 0 objects will
never show up, there is no support for memcg-aware shrinkers.  Shrinkers
are identified by their scan function, which is not always enough (e.g. 
hard to guess which super block's shrinker it is having only
"super_cache_scan").

To provide a better visibility and debug options for memory shrinkers this
patchset introduces a /sys/kernel/debug/shrinker interface, to some extent
similar to /sys/kernel/slab.

For each shrinker registered in the system a directory is created.  As
now, the directory will contain only a "scan" file, which allows to get
the number of managed objects for each memory cgroup (for memcg-aware
shrinkers) and each numa node (for numa-aware shrinkers on a numa
machine).  Other interfaces might be added in the future.

To make debugging more pleasant, the patchset also names all shrinkers, so
that debugfs entries can have meaningful names.


This patch (of 5):

Shrinker debugfs requires a way to represent memory cgroups without using
full paths, both for displaying information and getting input from a user.

Cgroup inode number is a perfect way, already used by bpf.

This commit adds a couple of helper functions which will be used to handle
memcg-aware shrinkers.

Link: https://lkml.kernel.org/r/20220601032227.4076670-1-roman.gushchin@linux.dev
Link: https://lkml.kernel.org/r/20220601032227.4076670-2-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin 
Acked-by: Muchun Song 
Cc: Dave Chinner 
Cc: Kent Overstreet 
Cc: Hillf Danton 
Cc: Christophe JAILLET 
Cc: Roman Gushchin 
Signed-off-by: Andrew Morton

Merge branch 'master' into mm-stable

2022-06-27T17:31:34+00:00

mm: kmem: make mem_cgroup_from_obj() vmalloc()-safe

2022-06-17T02:48:31+00:00

Currently mem_cgroup_from_obj() is not working properly with objects
allocated using vmalloc().  It creates problems in some cases, when it's
called for static objects belonging to modules or generally allocated
using vmalloc().

This patch makes mem_cgroup_from_obj() safe to be called on objects
allocated using vmalloc().

It also introduces mem_cgroup_from_slab_obj(), which is a faster version
to use in places when we know the object is either a slab object or a
generic slab page (e.g.  when adding an object to a lru list).

Link: https://lkml.kernel.org/r/20220610180310.1725111-1-roman.gushchin@linux.dev
Suggested-by: Kefeng Wang 
Signed-off-by: Roman Gushchin 
Tested-by: Linux Kernel Functional Testing 
Acked-by: Shakeel Butt 
Tested-by: Vasily Averin 
Acked-by: Michal Hocko 
Acked-by: Muchun Song 
Cc: Johannes Weiner 
Cc: Naresh Kamboju 
Cc: Qian Cai 
Cc: Kefeng Wang 
Cc: David S. Miller 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Cc: Jakub Kicinski 
Cc: Michal Koutný 
Cc: Paolo Abeni 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton

mm: memcontrol: add {pgscan,pgsteal}_{kswapd,direct} items in memory.stat of cgroup v2

2022-06-17T02:48:29+00:00

There are already statistics of {pgscan,pgsteal}_kswapd and
{pgscan,pgsteal}_direct of memcg event here, but now only the sum of the
two is displayed in memory.stat of cgroup v2.

In order to obtain more accurate information during monitoring and
debugging, and to align with the display in /proc/vmstat, it better to
display {pgscan,pgsteal}_kswapd and {pgscan,pgsteal}_direct separately.

Also, for forward compatibility, we still display pgscan and pgsteal items
so that it won't break existing applications.

[zhengqi.arch@bytedance.com: add comment for memcg_vm_event_stat (suggested by Michal)]
  Link: https://lkml.kernel.org/r/20220606154028.55030-1-zhengqi.arch@bytedance.com
[zhengqi.arch@bytedance.com: fix the doc, thanks to Johannes]
  Link: https://lkml.kernel.org/r/20220607064803.79363-1-zhengqi.arch@bytedance.com
Link: https://lkml.kernel.org/r/20220604082209.55174-1-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng 
Acked-by: Johannes Weiner 
Acked-by: Roman Gushchin 
Acked-by: Muchun Song 
Acked-by: Shakeel Butt 
Acked-by: Michal Hocko 
Cc: Muchun Song 
Cc: Jonathan Corbet 
Signed-off-by: Andrew Morton