linux.git/mm/memcontrol.c, branch v4.4

mm: memcontrol: fix possible memcg leak due to interrupted reclaim

2015-12-30T01:45:49+00:00

Memory cgroup reclaim can be interrupted with mem_cgroup_iter_break()
once enough pages have been reclaimed, in which case, in contrast to a
full round-trip over a cgroup sub-tree, the current position stored in
mem_cgroup_reclaim_iter of the target cgroup does not get invalidated
and so is left holding the reference to the last scanned cgroup.  If the
target cgroup does not get scanned again (we might have just reclaimed
the last page or all processes might exit and free their memory
voluntary), we will leak it, because there is nobody to put the
reference held by the iterator.

The problem is easy to reproduce by running the following command
sequence in a loop:

    mkdir /sys/fs/cgroup/memory/test
    echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
    echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
    memhog 150M
    echo $$ > /sys/fs/cgroup/memory/cgroup.procs
    rmdir test

The cgroups generated by it will never get freed.

This patch fixes this issue by making mem_cgroup_iter avoid taking
reference to the current position.  In order not to hit use-after-free
bug while running reclaim in parallel with cgroup deletion, we make use
of ->css_released cgroup callback to clear references to the dying
cgroup in all reclaim iterators that might refer to it.  This callback
is called right before scheduling rcu work which will free css, so if we
access iter->position from rcu read section, we might be sure it won't
go away under us.

[hannes@cmpxchg.org: clean up css ref handling]
Fixes: 5ac8fb31ad2e ("mm: memcontrol: convert reclaim iterator to simple css refcounting")
Signed-off-by: Vladimir Davydov 
Signed-off-by: Johannes Weiner 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Cc: 	[3.19+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: fix kerneldoc on mem_cgroup_replace_page

2015-12-12T18:15:34+00:00

Whoops, I missed removing the kerneldoc comment of the lrucare arg
removed from mem_cgroup_replace_page; but it's a good comment, keep it.

Signed-off-by: Hugh Dickins 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg: fix memory.high target

2015-12-12T18:15:34+00:00

When the memory.high threshold is exceeded, try_charge() schedules a
task_work to reclaim the excess.  The reclaim target is set to the
number of pages requested by try_charge().

This is wrong, because try_charge() usually charges more pages than
requested (batch > nr_pages) in order to refill per cpu stocks.  As a
result, a process in a cgroup can easily exceed memory.high
significantly when doing a lot of charges w/o returning to userspace
(e.g.  reading a file in big chunks).

Fix this issue by assuring that when exceeding memory.high a process
reclaims as many pages as were actually charged (i.e.  batch).

Signed-off-by: Vladimir Davydov 
Acked-by: Johannes Weiner 
Acked-by: Michal Hocko 
Cc: Tejun Heo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

cgroup: fix handling of multi-destination migration from subtree_control enabling

2015-12-03T15:18:21+00:00

Consider the following v2 hierarchy.

  P0 (+memory) --- P1 (-memory) --- A
                                 \- B
       
P0 has memory enabled in its subtree_control while P1 doesn't.  If
both A and B contain processes, they would belong to the memory css of
P1.  Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter.  IOW, enabling controllers
can cause atomic migrations into different csses.

The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses.  pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.

 WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
 Modules linked in:
 CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
 ...
  ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
  ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
  ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
 Call Trace:
  [] dump_stack+0x4e/0x82
  [] warn_slowpath_common+0x82/0xc0
  [] warn_slowpath_null+0x1a/0x20
  [] pids_cancel.constprop.6+0x31/0x40
  [] pids_can_attach+0x6d/0xf0
  [] cgroup_taskset_migrate+0x6c/0x330
  [] cgroup_migrate+0xf5/0x190
  [] cgroup_attach_task+0x176/0x200
  [] __cgroup_procs_write+0x2ad/0x460
  [] cgroup_procs_write+0x14/0x20
  [] cgroup_file_write+0x35/0x1c0
  [] kernfs_fop_write+0x141/0x190
  [] __vfs_write+0x28/0xe0
  [] vfs_write+0xac/0x1a0
  [] SyS_write+0x49/0xb0
  [] entry_SYSCALL_64_fastpath+0x12/0x76

This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated.  All controllers are
updated accordingly.

* Controllers which don't care whether there are one or multiple
  target csses can be converted trivially.  cpu, io, freezer, perf,
  netclassid and netprio fall in this category.

* cpuset's current implementation assumes that there's single source
  and destination and thus doesn't support v2 hierarchy already.  The
  only change made by this patchset is how that single destination css
  is obtained.

* memory migration path already doesn't do anything on v2.  How the
  single destination css is obtained is updated and the prep stage of
  mem_cgroup_can_attach() is reordered to accomodate the change.

* pids is the only controller which was affected by this bug.  It now
  correctly handles multi-destination migrations and no longer causes
  counter underflow from incorrect accounting.

Signed-off-by: Tejun Heo 
Reported-and-tested-by: Daniel Wagner 
Cc: Aleksa Sarai

mm/memcontrol.c: uninline mem_cgroup_usage

2015-11-07T01:50:42+00:00

gcc version 5.2.1 20151010 (Debian 5.2.1-22)
$ size mm/memcontrol.o mm/memcontrol.o.before
   text    data     bss     dec     hex filename
  35535    7908      64   43507    a9f3 mm/memcontrol.o
  35762    7908      64   43734    aad6 mm/memcontrol.o.before

Cc: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM

2015-11-07T01:50:42+00:00

__GFP_WAIT was used to signal that the caller was in atomic context and
could not sleep.  Now it is possible to distinguish between true atomic
context and callers that are not willing to sleep.  The latter should
clear __GFP_DIRECT_RECLAIM so kswapd will still wake.  As clearing
__GFP_WAIT behaves differently, there is a risk that people will clear the
wrong flags.  This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
indicate what it does -- setting it allows all reclaim activity, clearing
them prevents it.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman 
Acked-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Acked-by: Johannes Weiner 
Cc: Christoph Lameter 
Acked-by: David Rientjes 
Cc: Vitaly Wool 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd

2015-11-07T01:50:42+00:00

__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts.  They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve".  __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available.  Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative.  High priority users continue to use
__GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
  pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
  __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
  into this category where kswapd will still be woken but atomic reserves
  are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
  helper gfpflags_allow_blocking() where possible. This is because
  checking for __GFP_WAIT as was done historically now can trigger false
  positives. Some exceptions like dm-crypt.c exist where the code intent
  is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
  flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
  and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL.  They may
now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman 
Acked-by: Vlastimil Babka 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Cc: Christoph Lameter 
Cc: David Rientjes 
Cc: Vitaly Wool 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Merge branch 'akpm' (patches from Andrew)

2015-11-06T07:10:54+00:00

Merge patch-bomb from Andrew Morton:

 - inotify tweaks

 - some ocfs2 updates (many more are awaiting review)

 - various misc bits

 - kernel/watchdog.c updates

 - Some of mm.  I have a huge number of MM patches this time and quite a
   lot of it is quite difficult and much will be held over to next time.

* emailed patches from Andrew Morton : (162 commits)
  selftests: vm: add tests for lock on fault
  mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage
  mm: introduce VM_LOCKONFAULT
  mm: mlock: add new mlock system call
  mm: mlock: refactor mlock, munlock, and munlockall code
  kasan: always taint kernel on report
  mm, slub, kasan: enable user tracking by default with KASAN=y
  kasan: use IS_ALIGNED in memory_is_poisoned_8()
  kasan: Fix a type conversion error
  lib: test_kasan: add some testcases
  kasan: update reference to kasan prototype repo
  kasan: move KASAN_SANITIZE in arch/x86/boot/Makefile
  kasan: various fixes in documentation
  kasan: update log messages
  kasan: accurately determine the type of the bad access
  kasan: update reported bug types for kernel memory accesses
  kasan: update reported bug types for not user nor kernel memory accesses
  mm/kasan: prevent deadlock in kasan reporting
  mm/kasan: don't use kasan shadow pointer in generic functions
  mm/kasan: MODULE_VADDR is not available on all archs
  ...

memcg: fix thresholds for 32b architectures.

2015-11-06T03:34:48+00:00

Commit 424cdc141380 ("memcg: convert threshold to bytes") has fixed a
regression introduced by 3e32cb2e0a12 ("mm: memcontrol: lockless page
counters") where thresholds were silently converted to use page units
rather than bytes when interpreting the user input.

The fix is not complete, though, as properly pointed out by Ben Hutchings
during stable backport review.  The page count is converted to bytes but
unsigned long is used to hold the value which would be obviously not
sufficient for 32b systems with more than 4G thresholds.  The same applies
to usage as taken from mem_cgroup_usage which might overflow.

Let's remove this bytes vs.  pages internal tracking differences and
handle thresholds in page units internally.  Chage mem_cgroup_usage() to
return the value in page units and revert 424cdc141380 because this should
be sufficient for the consistent handling.  mem_cgroup_read_u64 as the
only users of mem_cgroup_usage outside of the threshold handling code is
converted to give the proper in bytes result.  It is doing that already
for page_counter output so this is more consistent as well.

The value presented to the userspace is still in bytes units.

Fixes: 424cdc141380 ("memcg: convert threshold to bytes")
Fixes: 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
Signed-off-by: Michal Hocko 
Reported-by: Ben Hutchings 
Reviewed-by: Vladimir Davydov 
Acked-by: Johannes Weiner 
Cc: 
From: Michal Hocko 
Subject: memcg-fix-thresholds-for-32b-architectures-fix

Cc: Ben Hutchings 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
From: Andrew Morton 
Subject: memcg-fix-thresholds-for-32b-architectures-fix-fix

don't attempt to inline mem_cgroup_usage()

The compiler ignores the inline anwyay.  And __always_inlining it adds 600
bytes of goop to the .o file.

Cc: Ben Hutchings 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: page_counter: let page_counter_try_charge() return bool

2015-11-06T03:34:48+00:00

page_counter_try_charge() currently returns 0 on success and -ENOMEM on
failure, which is surprising behavior given the function name.

Make it follow the expected pattern of try_stuff() functions that return a
boolean true to indicate success, or false for failure.

Signed-off-by: Johannes Weiner 
Acked-by: Michal Hocko 
Cc: Vladimir Davydov 
Signed-off-by: Linus Torvalds