linux.git/mm/memcontrol.c, branch v4.19

mm: memcontrol: print proper OOM header when no eligible victim left

2018-09-04T23:45:02+00:00

When the memcg OOM killer runs out of killable tasks, it currently
prints a WARN with no further OOM context.  This has caused some user
confusion.

Warnings indicate a kernel problem.  In a reported case, however, the
situation was triggered by a nonsensical memcg configuration (hard limit
set to 0).  But without any VM context this wasn't obvious from the
report, and it took some back and forth on the mailing list to identify
what is actually a trivial issue.

Handle this OOM condition like we handle it in the global OOM killer:
dump the full OOM context and tell the user we ran out of tasks.

This way the user can identify misconfigurations easily by themselves
and rectify the problem - without having to go through the hassle of
running into an obscure but unsettling warning, finding the appropriate
kernel mailing list and waiting for a kernel developer to remote-analyze
that the memcg configuration caused this.

If users cannot make sense of why the OOM killer was triggered or why it
failed, they will still report it to the mailing list, we know that from
experience.  So in case there is an actual kernel bug causing this,
kernel developers will very likely hear about it.

Link: http://lkml.kernel.org/r/20180821160406.22578-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner 
Acked-by: Michal Hocko 
Cc: Dmitry Vyukov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, oom: introduce memory.oom.group

2018-08-22T17:52:45+00:00

For some workloads an intervention from the OOM killer can be painful.
Killing a random task can bring the workload into an inconsistent state.

Historically, there are two common solutions for this
problem:
1) enabling panic_on_oom,
2) using a userspace daemon to monitor OOMs and kill
   all outstanding processes.

Both approaches have their downsides: rebooting on each OOM is an obvious
waste of capacity, and handling all in userspace is tricky and requires a
userspace agent, which will monitor all cgroups for OOMs.

In most cases an in-kernel after-OOM cleaning-up mechanism can eliminate
the necessity of enabling panic_on_oom.  Also, it can simplify the cgroup
management for userspace applications.

This commit introduces a new knob for cgroup v2 memory controller:
memory.oom.group.  The knob determines whether the cgroup should be
treated as an indivisible workload by the OOM killer.  If set, all tasks
belonging to the cgroup or to its descendants (if the memory cgroup is not
a leaf cgroup) are killed together or not at all.

To determine which cgroup has to be killed, we do traverse the cgroup
hierarchy from the victim task's cgroup up to the OOMing cgroup (or root)
and looking for the highest-level cgroup with memory.oom.group set.

Tasks with the OOM protection (oom_score_adj set to -1000) are treated as
an exception and are never killed.

This patch doesn't change the OOM victim selection algorithm.

Link: http://lkml.kernel.org/r/20180802003201.817-4-guro@fb.com
Signed-off-by: Roman Gushchin 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Cc: David Rientjes 
Cc: Tetsuo Handa 
Cc: Tejun Heo 
Cc: Vladimir Davydov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg: reduce memcg tree traversals for stats collection

2018-08-22T17:52:44+00:00

Currently cgroup-v1's memcg_stat_show traverses the memcg tree ~17 times
to collect the stats while cgroup-v2's memory_stat_show traverses the
memcg tree thrice.  On a large machine, a couple thousand memcgs is very
normal and if the churn is high and memcgs stick around during to several
reasons, tens of thousands of nodes in memcg tree can exist.  This patch
has refactored and shared the stat collection code between cgroup-v1 and
cgroup-v2 and has reduced the tree traversal to just one.

I ran a simple benchmark which reads the root_mem_cgroup's stat file
1000 times in the presense of 2500 memcgs on cgroup-v1. The results are:

Without the patch:
$ time ./read-root-stat-1000-times

real    0m1.663s
user    0m0.000s
sys     0m1.660s

With the patch:
$ time ./read-root-stat-1000-times

real    0m0.468s
user    0m0.000s
sys     0m0.467s

Link: http://lkml.kernel.org/r/20180724224635.143944-1-shakeelb@google.com
Signed-off-by: Shakeel Butt 
Acked-by: Michal Hocko 
Cc: Johannes Weiner 
Cc: Vladimir Davydov 
Cc: Greg Thelen 
Cc: Bruce Merry 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/vmscan.c: clear shrinker bit if there are no objects related to memcg

2018-08-17T23:20:31+00:00

To avoid further unneed calls of do_shrink_slab() for shrinkers, which
already do not have any charged objects in a memcg, their bits have to
be cleared.

This patch introduces a lockless mechanism to do that without races
without parallel list lru add.  After do_shrink_slab() returns
SHRINK_EMPTY the first time, we clear the bit and call it once again.
Then we restore the bit, if the new return value is different.

Note, that single smp_mb__after_atomic() in shrink_slab_memcg() covers
two situations:

1)list_lru_add()     shrink_slab_memcg
    list_add_tail()    for_each_set_bit() <--- read bit
                         do_shrink_slab() <--- missed list update (no barrier)
                     
    set_bit()            do_shrink_slab() <--- seen list update

This situation, when the first do_shrink_slab() sees set bit, but it
doesn't see list update (i.e., race with the first element queueing), is
rare.  So we don't add  before the first call of do_shrink_slab()
instead of this to do not slow down generic case.  Also, it's need the
second call as seen in below in (2).

2)list_lru_add()      shrink_slab_memcg()
    list_add_tail()     ...
    set_bit()           ...
  ...                   for_each_set_bit()
  do_shrink_slab()        do_shrink_slab()
    clear_bit()           ...
  ...                     ...
  list_lru_add()          ...
    list_add_tail()       clear_bit()
                      
    set_bit()             do_shrink_slab()

The barriers guarantee that the second do_shrink_slab() in the right
side task sees list update if really cleared the bit.  This case is
drawn in the code comment.

[Results/performance of the patchset]

After the whole patchset applied the below test shows signify increase
of performance:

  $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
  $mkdir /sys/fs/cgroup/memory/ct
  $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
      $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i;
			    echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
			    mkdir -p s/$i; mount -t tmpfs $i s/$i;
			    touch s/$i/file; done

Then, 5 sequential calls of drop caches:

  $time echo 3 > /proc/sys/vm/drop_caches

1)Before:
  0.00user 13.78system 0:13.78elapsed 99%CPU
  0.00user 5.59system 0:05.60elapsed 99%CPU
  0.00user 5.48system 0:05.48elapsed 99%CPU
  0.00user 8.35system 0:08.35elapsed 99%CPU
  0.00user 8.34system 0:08.35elapsed 99%CPU

2)After
  0.00user 1.10system 0:01.10elapsed 99%CPU
  0.00user 0.00system 0:00.01elapsed 64%CPU
  0.00user 0.01system 0:00.01elapsed 82%CPU
  0.00user 0.00system 0:00.01elapsed 64%CPU
  0.00user 0.01system 0:00.01elapsed 82%CPU

The results show the performance increases at least in 548 times.

Shakeel Butt tested this patchset with fork-bomb on his configuration:

 > I created 255 memcgs, 255 ext4 mounts and made each memcg create a
 > file containing few KiBs on corresponding mount. Then in a separate
 > memcg of 200 MiB limit ran a fork-bomb.
 >
 > I ran the "perf record -ag -- sleep 60" and below are the results:
 >
 > Without the patch series:
 > Samples: 4M of event 'cycles', Event count (approx.): 3279403076005
 > +  36.40%            fb.sh  [kernel.kallsyms]    [k] shrink_slab
 > +  18.97%            fb.sh  [kernel.kallsyms]    [k] list_lru_count_one
 > +   6.75%            fb.sh  [kernel.kallsyms]    [k] super_cache_count
 > +   0.49%            fb.sh  [kernel.kallsyms]    [k] down_read_trylock
 > +   0.44%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_iter
 > +   0.27%            fb.sh  [kernel.kallsyms]    [k] up_read
 > +   0.21%            fb.sh  [kernel.kallsyms]    [k] osq_lock
 > +   0.13%            fb.sh  [kernel.kallsyms]    [k] shmem_unused_huge_count
 > +   0.08%            fb.sh  [kernel.kallsyms]    [k] shrink_node_memcg
 > +   0.08%            fb.sh  [kernel.kallsyms]    [k] shrink_node
 >
 > With the patch series:
 > Samples: 4M of event 'cycles', Event count (approx.): 2756866824946
 > +  47.49%            fb.sh  [kernel.kallsyms]    [k] down_read_trylock
 > +  30.72%            fb.sh  [kernel.kallsyms]    [k] up_read
 > +   9.51%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_iter
 > +   1.69%            fb.sh  [kernel.kallsyms]    [k] shrink_node_memcg
 > +   1.35%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_protected
 > +   1.05%            fb.sh  [kernel.kallsyms]    [k] queued_spin_lock_slowpath
 > +   0.85%            fb.sh  [kernel.kallsyms]    [k] _raw_spin_lock
 > +   0.78%            fb.sh  [kernel.kallsyms]    [k] lruvec_lru_size
 > +   0.57%            fb.sh  [kernel.kallsyms]    [k] shrink_node
 > +   0.54%            fb.sh  [kernel.kallsyms]    [k] queue_work_on
 > +   0.46%            fb.sh  [kernel.kallsyms]    [k] shrink_slab_memcg

[ktkhai@virtuozzo.com: v9]
  Link: http://lkml.kernel.org/r/153112561772.4097.11011071937553113003.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063070859.1818.11870882950920963480.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai 
Acked-by: Vladimir Davydov 
Tested-by: Shakeel Butt 
Cc: Al Viro 
Cc: Andrey Ryabinin 
Cc: Chris Wilson 
Cc: Greg Kroah-Hartman 
Cc: Guenter Roeck 
Cc: "Huang, Ying" 
Cc: Johannes Weiner 
Cc: Josef Bacik 
Cc: Li RongQing 
Cc: Matthew Wilcox 
Cc: Matthias Kaehlcke 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Minchan Kim 
Cc: Philippe Ombredanne 
Cc: Roman Gushchin 
Cc: Sahitya Tummala 
Cc: Stephen Rothwell 
Cc: Tetsuo Handa 
Cc: Thomas Gleixner 
Cc: Waiman Long 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance

2018-08-17T23:20:31+00:00

Introduce set_shrinker_bit() function to set shrinker-related bit in
memcg shrinker bitmap, and set the bit after the first item is added and
in case of reparenting destroyed memcg's items.

This will allow next patch to make shrinkers be called only, in case of
they have charged objects at the moment, and to improve shrink_slab()
performance.

[ktkhai@virtuozzo.com: v9]
  Link: http://lkml.kernel.org/r/153112557572.4097.17315791419810749985.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063065671.1818.15914674956134687268.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai 
Acked-by: Vladimir Davydov 
Tested-by: Shakeel Butt 
Cc: Al Viro 
Cc: Andrey Ryabinin 
Cc: Chris Wilson 
Cc: Greg Kroah-Hartman 
Cc: Guenter Roeck 
Cc: "Huang, Ying" 
Cc: Johannes Weiner 
Cc: Josef Bacik 
Cc: Li RongQing 
Cc: Matthew Wilcox 
Cc: Matthias Kaehlcke 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Minchan Kim 
Cc: Philippe Ombredanne 
Cc: Roman Gushchin 
Cc: Sahitya Tummala 
Cc: Stephen Rothwell 
Cc: Tetsuo Handa 
Cc: Thomas Gleixner 
Cc: Waiman Long 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memcontrol.c: export mem_cgroup_is_root()

2018-08-17T23:20:31+00:00

This will be used in next patch.

Link: http://lkml.kernel.org/r/153063064347.1818.1987011484100392706.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai 
Acked-by: Vladimir Davydov 
Tested-by: Shakeel Butt 
Cc: Al Viro 
Cc: Andrey Ryabinin 
Cc: Chris Wilson 
Cc: Greg Kroah-Hartman 
Cc: Guenter Roeck 
Cc: "Huang, Ying" 
Cc: Johannes Weiner 
Cc: Josef Bacik 
Cc: Li RongQing 
Cc: Matthew Wilcox 
Cc: Matthias Kaehlcke 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Minchan Kim 
Cc: Philippe Ombredanne 
Cc: Roman Gushchin 
Cc: Sahitya Tummala 
Cc: Stephen Rothwell 
Cc: Tetsuo Handa 
Cc: Thomas Gleixner 
Cc: Waiman Long 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/list_lru: pass dst_memcg argument to memcg_drain_list_lru_node()

2018-08-17T23:20:31+00:00

This is just refactoring to allow the next patches to have dst_memcg
pointer in memcg_drain_list_lru_node().

Link: http://lkml.kernel.org/r/153063062118.1818.2761273817739499749.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai 
Acked-by: Vladimir Davydov 
Tested-by: Shakeel Butt 
Cc: Al Viro 
Cc: Andrey Ryabinin 
Cc: Chris Wilson 
Cc: Greg Kroah-Hartman 
Cc: Guenter Roeck 
Cc: "Huang, Ying" 
Cc: Johannes Weiner 
Cc: Josef Bacik 
Cc: Li RongQing 
Cc: Matthew Wilcox 
Cc: Matthias Kaehlcke 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Minchan Kim 
Cc: Philippe Ombredanne 
Cc: Roman Gushchin 
Cc: Sahitya Tummala 
Cc: Stephen Rothwell 
Cc: Tetsuo Handa 
Cc: Thomas Gleixner 
Cc: Waiman Long 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memcg: assign memcg-aware shrinkers bitmap to memcg

2018-08-17T23:20:30+00:00

Imagine a big node with many cpus, memory cgroups and containers.  Let
we have 200 containers, every container has 10 mounts, and 10 cgroups.
All container tasks don't touch foreign containers mounts.  If there is
intensive pages write, and global reclaim happens, a writing task has to
iterate over all memcgs to shrink slab, before it's able to go to
shrink_page_list().

Iteration over all the memcg slabs is very expensive: the task has to
visit 200 * 10 = 2000 shrinkers for every memcg, and since there are
2000 memcgs, the total calls are 2000 * 2000 = 4000000.

So, the shrinker makes 4 million do_shrink_slab() calls just to try to
isolate SWAP_CLUSTER_MAX pages in one of the actively writing memcg via
shrink_page_list().  I've observed a node spending almost 100% in
kernel, making useless iteration over already shrinked slab.

This patch adds bitmap of memcg-aware shrinkers to memcg.  The size of
the bitmap depends on bitmap_nr_ids, and during memcg life it's
maintained to be enough to fit bitmap_nr_ids shrinkers.  Every bit in
the map is related to corresponding shrinker id.

Next patches will maintain set bit only for really charged memcg.  This
will allow shrink_slab() to increase its performance in significant way.
See the last patch for the numbers.

[ktkhai@virtuozzo.com: v9]
  Link: http://lkml.kernel.org/r/153112549031.4097.3576147070498769979.stgit@localhost.localdomain
[ktkhai@virtuozzo.com: add comment to mem_cgroup_css_online()]
  Link: http://lkml.kernel.org/r/521f9e5f-c436-b388-fe83-4dc870bfb489@virtuozzo.com
Link: http://lkml.kernel.org/r/153063056619.1818.12550500883688681076.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai 
Acked-by: Vladimir Davydov 
Tested-by: Shakeel Butt 
Cc: Al Viro 
Cc: Andrey Ryabinin 
Cc: Chris Wilson 
Cc: Greg Kroah-Hartman 
Cc: Guenter Roeck 
Cc: "Huang, Ying" 
Cc: Johannes Weiner 
Cc: Josef Bacik 
Cc: Li RongQing 
Cc: Matthew Wilcox 
Cc: Matthias Kaehlcke 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Minchan Kim 
Cc: Philippe Ombredanne 
Cc: Roman Gushchin 
Cc: Sahitya Tummala 
Cc: Stephen Rothwell 
Cc: Tetsuo Handa 
Cc: Thomas Gleixner 
Cc: Waiman Long 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memcontrol.c: move up for_each_mem_cgroup{, _tree} defines

2018-08-17T23:20:30+00:00

Next patch requires these defines are above their current position, so
here they are moved to declarations.

Link: http://lkml.kernel.org/r/153063055665.1818.5200425793649695598.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai 
Acked-by: Vladimir Davydov 
Tested-by: Shakeel Butt 
Cc: Al Viro 
Cc: Andrey Ryabinin 
Cc: Chris Wilson 
Cc: Greg Kroah-Hartman 
Cc: Guenter Roeck 
Cc: "Huang, Ying" 
Cc: Johannes Weiner 
Cc: Josef Bacik 
Cc: Li RongQing 
Cc: Matthew Wilcox 
Cc: Matthias Kaehlcke 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Minchan Kim 
Cc: Philippe Ombredanne 
Cc: Roman Gushchin 
Cc: Sahitya Tummala 
Cc: Stephen Rothwell 
Cc: Tetsuo Handa 
Cc: Thomas Gleixner 
Cc: Waiman Long 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: introduce CONFIG_MEMCG_KMEM as combination of CONFIG_MEMCG && !CONFIG_SLOB

2018-08-17T23:20:30+00:00

Introduce new config option, which is used to replace repeating
CONFIG_MEMCG && !CONFIG_SLOB pattern.  Next patches add a little more
memcg+kmem related code, so let's keep the defines more clearly.

Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai 
Acked-by: Vladimir Davydov 
Tested-by: Shakeel Butt 
Cc: Al Viro 
Cc: Andrey Ryabinin 
Cc: Chris Wilson 
Cc: Greg Kroah-Hartman 
Cc: Guenter Roeck 
Cc: "Huang, Ying" 
Cc: Johannes Weiner 
Cc: Josef Bacik 
Cc: Li RongQing 
Cc: Matthew Wilcox 
Cc: Matthias Kaehlcke 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Minchan Kim 
Cc: Philippe Ombredanne 
Cc: Roman Gushchin 
Cc: Sahitya Tummala 
Cc: Stephen Rothwell 
Cc: Tetsuo Handa 
Cc: Thomas Gleixner 
Cc: Waiman Long 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds