linux.git/mm, branch v3.19

memcg, shmem: fix shmem migration to use lrucare

2015-02-05T21:35:29+00:00

It has been reported that 965GM might trigger

  VM_BUG_ON_PAGE(!lrucare && PageLRU(oldpage), oldpage)

in mem_cgroup_migrate when shmem wants to replace a swap cache page
because of shmem_should_replace_page (the page is allocated from an
inappropriate zone).  shmem_replace_page expects that the oldpage is not
on LRU list and calls mem_cgroup_migrate without lrucare.  This is
obviously incorrect because swapcache pages might be on the LRU list
(e.g. swapin readahead page).

Fix this by enabling lrucare for the migration in shmem_replace_page.
Also clarify that lrucare should be used even if one of the pages might
be on LRU list.

The BUG_ON will trigger only when CONFIG_DEBUG_VM is enabled but even
without that the migration code might leave the old page on an
inappropriate memcg' LRU which is not that critical because the page
would get removed with its last reference but it is still confusing.

Fixes: 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API")
Signed-off-by: Michal Hocko 
Reported-by: Chris Wilson 
Reported-by: Dave Airlie 
Acked-by: Hugh Dickins 
Acked-by: Johannes Weiner 
Cc: 	[3.17+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: export "high_memory" symbol on !MMU

2015-02-05T21:35:29+00:00

The symbol 'high_memory' is provided on both MMU- and NOMMU-kernels, but
only one of them is exported, which leads to module build errors in
drivers that work fine built-in:

  ERROR: "high_memory" [drivers/net/virtio_net.ko] undefined!
  ERROR: "high_memory" [drivers/net/ppp/ppp_mppe.ko] undefined!
  ERROR: "high_memory" [drivers/mtd/nand/nand.ko] undefined!
  ERROR: "high_memory" [crypto/tcrypt.ko] undefined!
  ERROR: "high_memory" [crypto/cts.ko] undefined!

This exports the symbol to get these to work on NOMMU as well.

Signed-off-by: Arnd Bergmann 
Cc: Kirill A. Shutemov 
Acked-by: Greg Ungerer 
Acked-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: pagewalk: call pte_hole() for VM_PFNMAP during walk_page_range

2015-02-05T21:35:29+00:00

walk_page_range() silently skips vma having VM_PFNMAP set, which leads
to undesirable behaviour at client end (who called walk_page_range).
Userspace applications get the wrong data, so the effect is like just
confusing users (if the applications just display the data) or sometimes
killing the processes (if the applications do something with
misunderstanding virtual addresses due to the wrong data.)

For example for pagemap_read, when no callbacks are called against
VM_PFNMAP vma, pagemap_read may prepare pagemap data for next virtual
address range at wrong index.

Eventually userspace may get wrong pagemap data for a task.
Corresponding to a VM_PFNMAP marked vma region, kernel may report
mappings from subsequent vma regions.  User space in turn may account
more pages (than really are) to the task.

In my case I was using procmem, procrack (Android utility) which uses
pagemap interface to account RSS pages of a task.  Due to this bug it
was giving a wrong picture for vmas (with VM_PFNMAP set).

Fixes: a9ff785e4437 ("mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas")
Signed-off-by: Shiraz Hashim 
Acked-by: Naoya Horiguchi 
Cc: 	[3.10+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

vm: make stack guard page errors return VM_FAULT_SIGSEGV rather than SIGBUS

2015-01-29T19:15:17+00:00

The stack guard page error case has long incorrectly caused a SIGBUS
rather than a SIGSEGV, but nobody actually noticed until commit
fee7e49d4514 ("mm: propagate error from stack expansion even for guard
page") because that error case was never actually triggered in any
normal situations.

Now that we actually report the error, people noticed the wrong signal
that resulted.  So far, only the test suite of libsigsegv seems to have
actually cared, but there are real applications that use libsigsegv, so
let's not wait for any of those to break.

Reported-and-tested-by: Takashi Iwai 
Tested-by: Jan Engelhardt 
Acked-by: Heiko Carstens  # "s390 still compiles and boots"
Cc: linux-arch@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

vm: add VM_FAULT_SIGSEGV handling support

2015-01-29T18:51:32+00:00

The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
"you should SIGSEGV" error, because the SIGSEGV case was generally
handled by the caller - usually the architecture fault handler.

That results in lots of duplication - all the architecture fault
handlers end up doing very similar "look up vma, check permissions, do
retries etc" - but it generally works.  However, there are cases where
the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.

In particular, when accessing the stack guard page, libsigsegv expects a
SIGSEGV.  And it usually got one, because the stack growth is handled by
that duplicated architecture fault handler.

However, when the generic VM layer started propagating the error return
from the stack expansion in commit fee7e49d4514 ("mm: propagate error
from stack expansion even for guard page"), that now exposed the
existing VM_FAULT_SIGBUS result to user space.  And user space really
expected SIGSEGV, not SIGBUS.

To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
duplicate architecture fault handlers about it.  They all already have
the code to handle SIGSEGV, so it's about just tying that new return
value to the existing code, but it's all a bit annoying.

This is the mindless minimal patch to do this.  A more extensive patch
would be to try to gather up the mostly shared fault handling logic into
one generic helper routine, and long-term we really should do that
cleanup.

Just from this patch, you can generally see that most architectures just
copied (directly or indirectly) the old x86 way of doing things, but in
the meantime that original x86 model has been improved to hold the VM
semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
"newer" things, so it would be a good idea to bring all those
improvements to the generic case and teach other architectures about
them too.

Reported-and-tested-by: Takashi Iwai 
Tested-by: Jan Engelhardt 
Acked-by: Heiko Carstens  # "s390 still compiles and boots"
Cc: linux-arch@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

mm/vmscan: fix highidx argument type

2015-01-26T21:37:18+00:00

for_each_zone_zonelist_nodemask wants an enum zone_type argument, but is
passed gfp_t:

  mm/vmscan.c:2658:9:    expected int enum zone_type [signed] highest_zoneidx
  mm/vmscan.c:2658:9:    got restricted gfp_t [usertype] gfp_mask
  mm/vmscan.c:2658:9: warning: incorrect type in argument 2 (different base types)
  mm/vmscan.c:2658:9:    expected int enum zone_type [signed] highest_zoneidx
  mm/vmscan.c:2658:9:    got restricted gfp_t [usertype] gfp_mask

convert argument to the correct type.

Signed-off-by: Michael S. Tsirkin 
Cc: Johannes Weiner 
Cc: Vladimir Davydov 
Cc: Rik van Riel 
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: Suleiman Souhlal 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg: remove extra newlines from memcg oom kill log

2015-01-26T21:37:18+00:00

Commit e61734c55c24 ("cgroup: remove cgroup->name") added two extra
newlines to memcg oom kill log messages.  This makes dmesg hard to read
and parse.  The issue affects 3.15+.

Example:

  Task in /t                          <<< extra #1
   killed as a result of limit of /t
                                      <<< extra #2
  memory: usage 102400kB, limit 102400kB, failcnt 274712

Remove the extra newlines from memcg oom kill messages, so the messages
look like:

  Task in /t killed as a result of limit of /t
  memory: usage 102400kB, limit 102400kB, failcnt 240649

Fixes: e61734c55c24 ("cgroup: remove cgroup->name")
Signed-off-by: Greg Thelen 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: page_alloc: embed OOM killing naturally into allocation slowpath

2015-01-26T21:37:18+00:00

The OOM killing invocation does a lot of duplicative checks against the
task's allocation context.  Rework it to take advantage of the existing
checks in the allocator slowpath.

The OOM killer is invoked when the allocator is unable to reclaim any
pages but the allocation has to keep looping.  Instead of having a check
for __GFP_NORETRY hidden in oom_gfp_allowed(), just move the OOM
invocation to the true branch of should_alloc_retry().  The __GFP_FS
check from oom_gfp_allowed() can then be moved into the OOM avoidance
branch in __alloc_pages_may_oom(), along with the PF_DUMPCORE test.

__alloc_pages_may_oom() can then signal to the caller whether the OOM
killer was invoked, instead of requiring it to duplicate the order and
high_zoneidx checks to guess this when deciding whether to continue.

Signed-off-by: Johannes Weiner 
Acked-by: Michal Hocko 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: mmu_gather: use tlb->end != 0 only for TLB invalidation

2015-01-13T02:20:40+00:00

When batching up address ranges for TLB invalidation, we check tlb->end
!= 0 to indicate that some pages have actually been unmapped.

As of commit f045bbb9fa1b ("mmu_gather: fix over-eager
tlb_flush_mmu_free() calling"), we use the same check for freeing these
pages in order to avoid a performance regression where we call
free_pages_and_swap_cache even when no pages are actually queued up.

Unfortunately, the range could have been reset (tlb->end = 0) by
tlb_end_vma, which has been shown to cause memory leaks on arm64.
Furthermore, investigation into these leaks revealed that the fullmm
case on task exit no longer invalidates the TLB, by virtue of tlb->end
 == 0 (in 3.18, need_flush would have been set).

This patch resolves the problem by reverting commit f045bbb9fa1b, using
instead tlb->local.nr as the predicate for page freeing in
tlb_flush_mmu_free and ensuring that tlb->end is initialised to a
non-zero value in the fullmm case.

Tested-by: Mark Langsdorf 
Tested-by: Dave Hansen 
Signed-off-by: Will Deacon 
Signed-off-by: Linus Torvalds

mm: fix corner case in anon_vma endless growing prevention

2015-01-11T19:45:10+00:00

Fix for BUG_ON(anon_vma->degree) splashes in unlink_anon_vmas() ("kernel
BUG at mm/rmap.c:399!") caused by commit 7a3ef208e662 ("mm: prevent
endless growth of anon_vma hierarchy")

Anon_vma_clone() is usually called for a copy of source vma in
destination argument.  If source vma has anon_vma it should be already
in dst->anon_vma.  NULL in dst->anon_vma is used as a sign that it's
called from anon_vma_fork().  In this case anon_vma_clone() finds
anon_vma for reusing.

Vma_adjust() calls it differently and this breaks anon_vma reusing
logic: anon_vma_clone() links vma to old anon_vma and updates degree
counters but vma_adjust() overrides vma->anon_vma right after that.  As
a result final unlink_anon_vmas() decrements degree for wrong anon_vma.

This patch assigns ->anon_vma before calling anon_vma_clone().

Signed-off-by: Konstantin Khlebnikov 
Reported-and-tested-by: Chris Clayton 
Reported-and-tested-by: Oded Gabbay 
Reported-and-tested-by: Chih-Wei Huang 
Acked-by: Rik van Riel 
Acked-by: Vlastimil Babka 
Cc: Daniel Forrest 
Cc: Michal Hocko 
Cc: stable@vger.kernel.org  # to match back-porting of 7a3ef208e662
Signed-off-by: Linus Torvalds