linux-stable.git/mm/rmap.c, branch linux-3.4.y

mm: fix anon_vma->degree underflow in anon_vma endless growing prevention

2015-04-14T09:34:04+00:00

commit 3fe89b3e2a7bbf3e97657104b9b33a9d81b950b3 upstream.

I have constantly stumbled upon "kernel BUG at mm/rmap.c:399!" after
upgrading to 3.19 and had no luck with 4.0-rc1 neither.

So, after looking into new logic introduced by commit 7a3ef208e662 ("mm:
prevent endless growth of anon_vma hierarchy"), I found chances are that
unlink_anon_vmas() is called without incrementing dst->anon_vma->degree
in anon_vma_clone() due to allocation failure.  If dst->anon_vma is not
NULL in error path, its degree will be incorrectly decremented in
unlink_anon_vmas() and eventually underflow when exiting as a result of
another call to unlink_anon_vmas().  That's how "kernel BUG at
mm/rmap.c:399!" is triggered for me.

This patch fixes the underflow by dropping dst->anon_vma when allocation
fails.  It's safe to do so regardless of original value of dst->anon_vma
because dst->anon_vma doesn't have valid meaning if anon_vma_clone()
fails.  Besides, callers don't care dst->anon_vma in such case neither.

Also suggested by Michal Hocko, we can clean up vma_adjust() a bit as
anon_vma_clone() now does the work.

[akpm@linux-foundation.org: tweak comment]
Fixes: 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy")
Signed-off-by: Leon Yu 
Signed-off-by: Konstantin Khlebnikov 
Reviewed-by: Michal Hocko 
Acked-by: Rik van Riel 
Acked-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Zefan Li

mm: prevent endless growth of anon_vma hierarchy

2015-04-14T09:33:51+00:00

commit 7a3ef208e662f4b63d43a23f61a64a129c525bbc upstream.

Constantly forking task causes unlimited grow of anon_vma chain.  Each
next child allocates new level of anon_vmas and links vma to all
previous levels because pages might be inherited from any level.

This patch adds heuristic which decides to reuse existing anon_vma
instead of forking new one.  It adds counter anon_vma->degree which
counts linked vmas and directly descending anon_vmas and reuses anon_vma
if counter is lower than two.  As a result each anon_vma has either vma
or at least two descending anon_vmas.  In such trees half of nodes are
leafs with alive vmas, thus count of anon_vmas is no more than two times
bigger than count of vmas.

This heuristic reuses anon_vmas as few as possible because each reuse
adds false aliasing among vmas and rmap walker ought to scan more ptes
when it searches where page is might be mapped.

Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu
Fixes: 5beb49305251 ("mm: change anon_vma linking to fix multi-process server scalability issue")
[akpm@linux-foundation.org: fix typo, per Rik]
Signed-off-by: Konstantin Khlebnikov 
Reported-by: Daniel Forrest 
Tested-by: Michal Hocko 
Tested-by: Jerome Marchand 
Reviewed-by: Michal Hocko 
Reviewed-by: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li

mm: try_to_unmap_cluster() should lock_page() before mlocking

2014-08-07T19:00:11+00:00

commit 57e68e9cd65b4b8eb4045a1e0d0746458502554c upstream.

A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin
fuzzing with trinity.  The call site try_to_unmap_cluster() does not lock
the pages other than its check_page parameter (which is already locked).

The BUG_ON in mlock_vma_page() is not documented and its purpose is
somewhat unclear, but apparently it serializes against page migration,
which could otherwise fail to transfer the PG_mlocked flag.  This would
not be fatal, as the page would be eventually encountered again, but
NR_MLOCK accounting would become distorted nevertheless.  This patch adds
a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that
effect.

The call site try_to_unmap_cluster() is fixed so that for page !=
check_page, trylock_page() is attempted (to avoid possible deadlocks as we
already have check_page locked) and mlock_vma_page() is performed only
upon success.  If the page lock cannot be obtained, the page is left
without PG_mlocked, which is again not a problem in the whole unevictable
memory design.

Signed-off-by: Vlastimil Babka 
Signed-off-by: Bob Liu 
Reported-by: Sasha Levin 
Cc: Wanpeng Li 
Cc: Michel Lespinasse 
Cc: KOSAKI Motohiro 
Acked-by: Rik van Riel 
Cc: David Rientjes 
Cc: Mel Gorman 
Cc: Hugh Dickins 
Cc: Joonsoo Kim 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings 
Cc: Yijing Wang 
Signed-off-by: Greg Kroah-Hartman

mm: fix sleeping function warning from __put_anon_vma

2014-07-01T03:01:31+00:00

commit 7f39dda9d86fb4f4f17af0de170decf125726f8c upstream.

Trinity reports BUG:

  sleeping function called from invalid context at kernel/locking/rwsem.c:47
  in_atomic(): 0, irqs_disabled(): 0, pid: 5787, name: trinity-c27

__might_sleep < down_write < __put_anon_vma < page_get_anon_vma <
migrate_pages < compact_zone < compact_zone_order < try_to_compact_pages ..

Right, since conversion to mutex then rwsem, we should not put_anon_vma()
from inside an rcu_read_lock()ed section: fix the two places that did so.
And add might_sleep() to anon_vma_free(), as suggested by Peter Zijlstra.

Fixes: 88c22088bf23 ("mm: optimize page_lock_anon_vma() fast-path")
Reported-by: Dave Jones 
Signed-off-by: Hugh Dickins 
Cc: Peter Zijlstra 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: rmap: fix use-after-free in __put_anon_vma

2014-06-11T19:04:13+00:00

commit 624483f3ea82598ab0f62f1bdb9177f531ab1892 upstream.

While working address sanitizer for kernel I've discovered
use-after-free bug in __put_anon_vma.

For the last anon_vma, anon_vma->root freed before child anon_vma.
Later in anon_vma_free(anon_vma) we are referencing to already freed
anon_vma->root to check rwsem.

This fixes it by freeing the child anon_vma before freeing
anon_vma->root.

Signed-off-by: Andrey Ryabinin 
Acked-by: Peter Zijlstra 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm/hugetlb: check for pte NULL pointer in __page_check_address()

2014-01-08T17:42:12+00:00

commit 98398c32f6687ee1e1f3ae084effb4b75adb0747 upstream.

In __page_check_address(), if address's pud is not present,
huge_pte_offset() will return NULL, we should check the return value.

Signed-off-by: Jianguo Wu 
Cc: Naoya Horiguchi 
Cc: Mel Gorman 
Cc: qiuxishi 
Cc: Hanjun Guo 
Acked-by: Kirill A. Shutemov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: fix XFS oops due to dirty pages without buffers on s390

2012-10-31T17:02:56+00:00

commit ef5d437f71afdf4afdbab99213add99f4b1318fd upstream.

On s390 any write to a page (even from kernel itself) sets architecture
specific page dirty bit.  Thus when a page is written to via buffered
write, HW dirty bit gets set and when we later map and unmap the page,
page_remove_rmap() finds the dirty bit and calls set_page_dirty().

Dirtying of a page which shouldn't be dirty can cause all sorts of
problems to filesystems.  The bug we observed in practice is that
buffers from the page get freed, so when the page gets later marked as
dirty and writeback writes it, XFS crashes due to an assertion
BUG_ON(!PagePrivate(page)) in page_buffers() called from
xfs_count_page_state().

Similar problem can also happen when zero_user_segment() call from
xfs_vm_writepage() (or block_write_full_page() for that matter) set the
hardware dirty bit during writeback, later buffers get freed, and then
page unmapped.

Fix the issue by ignoring s390 HW dirty bit for page cache pages of
mappings with mapping_cap_account_dirty().  This is safe because for
such mappings when a page gets marked as writeable in PTE it is also
marked dirty in do_wp_page() or do_page_fault().  When the dirty bit is
cleared by clear_page_dirty_for_io(), the page gets writeprotected in
page_mkclean().  So pagecache page is writeable if and only if it is
dirty.

Thanks to Hugh Dickins for pointing out mapping has to have
mapping_cap_account_dirty() for things to work and proposing a cleaned
up variant of the patch.

The patch has survived about two hours of running fsx-linux on tmpfs
while heavily swapping and several days of running on out build machines
where the original problem was triggered.

Signed-off-by: Jan Kara 
Cc: Martin Schwidefsky 
Cc: Mel Gorman 
Cc: Hugh Dickins 
Cc: Heiko Carstens 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

memcg: use new logic for page stat accounting

2012-03-22T00:55:01+00:00

Now, page-stat-per-memcg is recorded into per page_cgroup flag by
duplicating page's status into the flag.  The reason is that memcg has a
feature to move a page from a group to another group and we have race
between "move" and "page stat accounting",

Under current logic, assume CPU-A and CPU-B.  CPU-A does "move" and CPU-B
does "page stat accounting".

When CPU-A goes 1st,

            CPU-A                           CPU-B
                                    update "struct page" info.
    move_lock_mem_cgroup(memcg)
    see pc->flags
    copy page stat to new group
    overwrite pc->mem_cgroup.
    move_unlock_mem_cgroup(memcg)
                                    move_lock_mem_cgroup(mem)
                                    set pc->flags
                                    update page stat accounting
                                    move_unlock_mem_cgroup(mem)

stat accounting is guarded by move_lock_mem_cgroup() and "move" logic
(CPU-A) doesn't see changes in "struct page" information.

But it's costly to have the same information both in 'struct page' and
'struct page_cgroup'.  And, there is a potential problem.

For example, assume we have PG_dirty accounting in memcg.
PG_..is a flag for struct page.
PCG_ is a flag for struct page_cgroup.
(This is just an example. The same problem can be found in any
 kind of page stat accounting.)

	  CPU-A                               CPU-B
      TestSet PG_dirty
      (delay)                        TestClear PG_dirty
                                     if (TestClear(PCG_dirty))
                                          memcg->nr_dirty--
      if (TestSet(PCG_dirty))
          memcg->nr_dirty++

Here, memcg->nr_dirty = +1, this is wrong.  This race was reported by Greg
Thelen .  Now, only FILE_MAPPED is supported but
fortunately, it's serialized by page table lock and this is not real bug,
_now_,

If this potential problem is caused by having duplicated information in
struct page and struct page_cgroup, we may be able to fix this by using
original 'struct page' information.  But we'll have a problem in "move
account"

Assume we use only PG_dirty.

         CPU-A                   CPU-B
    TestSet PG_dirty
    (delay)                    move_lock_mem_cgroup()
                               if (PageDirty(page))
                                      new_memcg->nr_dirty++
                               pc->mem_cgroup = new_memcg;
                               move_unlock_mem_cgroup()
    move_lock_mem_cgroup()
    memcg = pc->mem_cgroup
    new_memcg->nr_dirty++

accounting information may be double-counted.  This was original reason to
have PCG_xxx flags but it seems PCG_xxx has another problem.

I think we need a bigger lock as

     move_lock_mem_cgroup(page)
     TestSetPageDirty(page)
     update page stats (without any checks)
     move_unlock_mem_cgroup(page)

This fixes both of problems and we don't have to duplicate page flag into
page_cgroup.  Please note: move_lock_mem_cgroup() is held only when there
are possibility of "account move" under the system.  So, in most path,
status update will go without atomic locks.

This patch introduces mem_cgroup_begin_update_page_stat() and
mem_cgroup_end_update_page_stat() both should be called at modifying
'struct page' information if memcg takes care of it.  as

     mem_cgroup_begin_update_page_stat()
     modify page information
     mem_cgroup_update_page_stat()
     => never check any 'struct page' info, just update counters.
     mem_cgroup_end_update_page_stat().

This patch is slow because we need to call begin_update_page_stat()/
end_update_page_stat() regardless of accounted will be changed or not.  A
following patch adds an easy optimization and reduces the cost.

[akpm@linux-foundation.org: s/lock/locked/]
[hughd@google.com: fix deadlock by avoiding stat lock when anon]
Signed-off-by: KAMEZAWA Hiroyuki 
Cc: Greg Thelen 
Acked-by: Johannes Weiner 
Cc: Michal Hocko 
Cc: KOSAKI Motohiro 
Cc: Ying Han 
Signed-off-by: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

rmap: anon_vma_prepare: Reduce code duplication by calling anon_vma_chain_link

2012-03-22T00:54:57+00:00

Reduce code duplication by calling anon_vma_chain_link() from
anon_vma_prepare().

Also move anon_vmal_chain_link() to a more suitable location in the file.

Signed-off-by: Kautuk Consul 
Acked-by: Peter Zijlstra 
Cc: Hugh Dickins 
Reviewed-by: KAMEZWA Hiroyuki 
Cc: Mel Gorman 
Acked-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: replace PAGE_MIGRATION with IS_ENABLED(CONFIG_MIGRATION)

2012-03-22T00:54:57+00:00

Since commit 2a11c8ea20bf ("kconfig: Introduce IS_ENABLED(),
IS_BUILTIN() and IS_MODULE()") there is a generic grep-friendly method
for checking config options in C expressions.

Signed-off-by: Konstantin Khlebnikov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds