linux-stable.git/mm, branch v3.10.73

mm/memory.c: actually remap enough memory

2015-03-18T12:22:28+00:00

commit 9cb12d7b4ccaa976f97ce0c5fd0f1b6a83bc2a75 upstream.

For whatever reason, generic_access_phys() only remaps one page, but
actually allows to access arbitrary size.  It's quite easy to trigger
large reads, like printing out large structure with gdb, which leads to a
crash.  Fix it by remapping correct size.

Fixes: 28b2ee20c7cb ("access_process_vm device memory infrastructure")
Signed-off-by: Grazvydas Ignotas 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm/compaction: fix wrong order check in compact_finished()

2015-03-18T12:22:28+00:00

commit 372549c2a3778fd3df445819811c944ad54609ca upstream.

What we want to check here is whether there is highorder freepage in buddy
list of other migratetype in order to steal it without fragmentation.
But, current code just checks cc->order which means allocation request
order.  So, this is wrong.

Without this fix, non-movable synchronous compaction below pageblock order
would not stopped until compaction is complete, because migratetype of
most pageblocks are movable and high order freepage made by compaction is
usually on movable type buddy list.

There is some report related to this bug. See below link.

  http://www.spinics.net/lists/linux-mm/msg81666.html

Although the issued system still has load spike comes from compaction,
this makes that system completely stable and responsive according to his
report.

stress-highalloc test in mmtests with non movable order 7 allocation
doesn't show any notable difference in allocation success rate, but, it
shows more compaction success rate.

Compaction success rate (Compaction success * 100 / Compaction stalls, %)
18.47 : 28.94

Fixes: 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page immediately when it is made available")
Signed-off-by: Joonsoo Kim 
Acked-by: Vlastimil Babka 
Reviewed-by: Zhang Yanfei 
Cc: Mel Gorman 
Cc: David Rientjes 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm/nommu.c: fix arithmetic overflow in __vm_enough_memory()

2015-03-18T12:22:28+00:00

commit 8138a67a5557ffea3a21dfd6f037842d4e748513 upstream.

I noticed that "allowed" can easily overflow by falling below 0, because
(total_vm / 32) can be larger than "allowed".  The problem occurs in
OVERCOMMIT_NONE mode.

In this case, a huge allocation can success and overcommit the system
(despite OVERCOMMIT_NONE mode).  All subsequent allocations will fall
(system-wide), so system become unusable.

The problem was masked out by commit c9b1d0981fcc
("mm: limit growth of 3% hardcoded other user reserve"),
but it's easy to reproduce it on older kernels:
1) set overcommit_memory sysctl to 2
2) mmap() large file multiple times (with VM_SHARED flag)
3) try to malloc() large amount of memory

It also can be reproduced on newer kernels, but miss-configured
sysctl_user_reserve_kbytes is required.

Fix this issue by switching to signed arithmetic here.

Signed-off-by: Roman Gushchin 
Cc: Andrew Shewmaker 
Cc: Rik van Riel 
Cc: Konstantin Khlebnikov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm/mmap.c: fix arithmetic overflow in __vm_enough_memory()

2015-03-18T12:22:27+00:00

commit 5703b087dc8eaf47bfb399d6cf512d471beff405 upstream.

I noticed, that "allowed" can easily overflow by falling below 0,
because (total_vm / 32) can be larger than "allowed".  The problem
occurs in OVERCOMMIT_NONE mode.

In this case, a huge allocation can success and overcommit the system
(despite OVERCOMMIT_NONE mode).  All subsequent allocations will fall
(system-wide), so system become unusable.

The problem was masked out by commit c9b1d0981fcc
("mm: limit growth of 3% hardcoded other user reserve"),
but it's easy to reproduce it on older kernels:
1) set overcommit_memory sysctl to 2
2) mmap() large file multiple times (with VM_SHARED flag)
3) try to malloc() large amount of memory

It also can be reproduced on newer kernels, but miss-configured
sysctl_user_reserve_kbytes is required.

Fix this issue by switching to signed arithmetic here.

[akpm@linux-foundation.org: use min_t]
Signed-off-by: Roman Gushchin 
Cc: Andrew Shewmaker 
Cc: Rik van Riel 
Cc: Konstantin Khlebnikov 
Reviewed-by: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm/hugetlb: add migration entry check in __unmap_hugepage_range

2015-03-18T12:22:27+00:00

commit 9fbc1f635fd0bd28cb32550211bf095753ac637a upstream.

If __unmap_hugepage_range() tries to unmap the address range over which
hugepage migration is on the way, we get the wrong page because pte_page()
doesn't work for migration entries.  This patch simply clears the pte for
migration entries as we do for hwpoison entries.

Fixes: 290408d4a2 ("hugetlb: hugepage migration core")
Signed-off-by: Naoya Horiguchi 
Cc: Hugh Dickins 
Cc: James Hogan 
Cc: David Rientjes 
Cc: Mel Gorman 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Rik van Riel 
Cc: Andrea Arcangeli 
Cc: Luiz Capitulino 
Cc: Nishanth Aravamudan 
Cc: Lee Schermerhorn 
Cc: Steve Capper 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: pagewalk: call pte_hole() for VM_PFNMAP during walk_page_range

2015-02-11T06:48:16+00:00

commit 23aaed6659df9adfabe9c583e67a36b54e21df46 upstream.

walk_page_range() silently skips vma having VM_PFNMAP set, which leads
to undesirable behaviour at client end (who called walk_page_range).
Userspace applications get the wrong data, so the effect is like just
confusing users (if the applications just display the data) or sometimes
killing the processes (if the applications do something with
misunderstanding virtual addresses due to the wrong data.)

For example for pagemap_read, when no callbacks are called against
VM_PFNMAP vma, pagemap_read may prepare pagemap data for next virtual
address range at wrong index.

Eventually userspace may get wrong pagemap data for a task.
Corresponding to a VM_PFNMAP marked vma region, kernel may report
mappings from subsequent vma regions.  User space in turn may account
more pages (than really are) to the task.

In my case I was using procmem, procrack (Android utility) which uses
pagemap interface to account RSS pages of a task.  Due to this bug it
was giving a wrong picture for vmas (with VM_PFNMAP set).

Fixes: a9ff785e4437 ("mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas")
Signed-off-by: Shiraz Hashim 
Acked-by: Naoya Horiguchi 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: Don't count the stack guard page towards RLIMIT_STACK

2015-01-16T14:59:03+00:00

commit 690eac53daff34169a4d74fc7bfbd388c4896abb upstream.

Commit fee7e49d4514 ("mm: propagate error from stack expansion even for
guard page") made sure that we return the error properly for stack
growth conditions.  It also theorized that counting the guard page
towards the stack limit might break something, but also said "Let's see
if anybody notices".

Somebody did notice.  Apparently android-x86 sets the stack limit very
close to the limit indeed, and including the guard page in the rlimit
check causes the android 'zygote' process problems.

So this adds the (fairly trivial) code to make the stack rlimit check be
against the actual real stack size, rather than the size of the vma that
includes the guard page.

Reported-and-tested-by: Chih-Wei Huang 
Cc: Jay Foad 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: propagate error from stack expansion even for guard page

2015-01-16T14:59:03+00:00

commit fee7e49d45149fba60156f5b59014f764d3e3728 upstream.

Jay Foad reports that the address sanitizer test (asan) sometimes gets
confused by a stack pointer that ends up being outside the stack vma
that is reported by /proc/maps.

This happens due to an interaction between RLIMIT_STACK and the guard
page: when we do the guard page check, we ignore the potential error
from the stack expansion, which effectively results in a missing guard
page, since the expected stack expansion won't have been done.

And since /proc/maps explicitly ignores the guard page (commit
d7824370e263: "mm: fix up some user-visible effects of the stack guard
page"), the stack pointer ends up being outside the reported stack area.

This is the minimal patch: it just propagates the error.  It also
effectively makes the guard page part of the stack limit, which in turn
measn that the actual real stack is one page less than the stack limit.

Let's see if anybody notices.  We could teach acct_stack_growth() to
allow an extra page for a grow-up/grow-down stack in the rlimit test,
but I don't want to add more complexity if it isn't needed.

Reported-and-tested-by: Jay Foad 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed

2015-01-16T14:59:03+00:00

commit 9e5e3661727eaf960d3480213f8e87c8d67b6956 upstream.

Charles Shirron and Paul Cassella from Cray Inc have reported kswapd
stuck in a busy loop with nothing left to balance, but
kswapd_try_to_sleep() failing to sleep.  Their analysis found the cause
to be a combination of several factors:

1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait

2. The process has been killed (by OOM in this case), but has not yet been
   scheduled to remove itself from the waitqueue and die.

3. kswapd checks for throttled processes in prepare_kswapd_sleep():

        if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
                wake_up(&pgdat->pfmemalloc_wait);
		return false; // kswapd will not go to sleep
	}

   However, for a process that was already killed, wake_up() does not remove
   the process from the waitqueue, since try_to_wake_up() checks its state
   first and returns false when the process is no longer waiting.

4. kswapd is running on the same CPU as the only CPU that the process is
   allowed to run on (through cpus_allowed, or possibly single-cpu system).

5. CONFIG_PREEMPT_NONE=y kernel is used. If there's nothing to balance, kswapd
   encounters no voluntary preemption points and repeatedly fails
   prepare_kswapd_sleep(), blocking the process from running and removing
   itself from the waitqueue, which would let kswapd sleep.

So, the source of the problem is that we prevent kswapd from going to
sleep until there are processes waiting on the pfmemalloc_wait queue,
and a process waiting on a queue is guaranteed to be removed from the
queue only when it gets scheduled.  This was done to make sure that no
process is left sleeping on pfmemalloc_wait when kswapd itself goes to
sleep.

However, it isn't necessary to postpone kswapd sleep until the
pfmemalloc_wait queue actually empties.  To prevent processes from being
left sleeping, it's actually enough to guarantee that all processes
waiting on pfmemalloc_wait queue have been woken up by the time we put
kswapd to sleep.

This patch therefore fixes this issue by substituting 'wake_up' with
'wake_up_all' and removing 'return false' in the code snippet from
prepare_kswapd_sleep() above.  Note that if any process puts itself in
the queue after this waitqueue_active() check, or after the wake up
itself, it means that the process will also wake up kswapd - and since
we are under prepare_to_wait(), the wake up won't be missed.  Also we
update the comment prepare_kswapd_sleep() to hopefully more clearly
describe the races it is preventing.

Fixes: 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage")
Signed-off-by: Vlastimil Babka 
Signed-off-by: Vladimir Davydov 
Cc: Mel Gorman 
Cc: Johannes Weiner 
Acked-by: Michal Hocko 
Acked-by: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: fix swapoff hang after page migration and fork

2014-12-16T17:09:42+00:00

commit 2022b4d18a491a578218ce7a4eca8666db895a73 upstream.

I've been seeing swapoff hangs in recent testing: it's cycling around
trying unsuccessfully to find an mm for some remaining pages of swap.

I have been exercising swap and page migration more heavily recently,
and now notice a long-standing error in copy_one_pte(): it's trying to
add dst_mm to swapoff's mmlist when it finds a swap entry, but is doing
so even when it's a migration entry or an hwpoison entry.

Which wouldn't matter much, except it adds dst_mm next to src_mm,
assuming src_mm is already on the mmlist: which may not be so.  Then if
pages are later swapped out from dst_mm, swapoff won't be able to find
where to replace them.

There's already a !non_swap_entry() test for stats: move that up before
the swap_duplicate() and the addition to mmlist.

Signed-off-by: Hugh Dickins 
Cc: Kelley Nielsen 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman