linux-stable.git/include/linux/mm.h, branch v4.0.7

mm: allow page fault handlers to perform the COW

2015-02-17T01:56:03+00:00

Currently COW of an XIP file is done by first bringing in a read-only
mapping, then retrying the fault and copying the page.  It is much more
efficient to tell the fault handler that a COW is being attempted (by
passing in the pre-allocated page in the vm_fault structure), and allow
the handler to perform the COW operation itself.

The handler cannot insert the page itself if there is already a read-only
mapping at that address, so allow the handler to return VM_FAULT_LOCKED
and set the fault_page to be NULL.  This indicates to the MM code that the
i_mmap_lock is held instead of the page lock.

Signed-off-by: Matthew Wilcox 
Acked-by: Kirill A. Shutemov 
Cc: Andreas Dilger 
Cc: Boaz Harrosh 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: Jan Kara 
Cc: Jens Axboe 
Cc: Mathieu Desnoyers 
Cc: Randy Dunlap 
Cc: Ross Zwisler 
Cc: Theodore Ts'o 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

fs/proc/task_mmu.c: add user-space support for resetting mm->hiwater_rss (peak RSS)

2015-02-13T02:54:12+00:00

Peak resident size of a process can be reset back to the process's
current rss value by writing "5" to /proc/pid/clear_refs.  The driving
use-case for this would be getting the peak RSS value, which can be
retrieved from the VmHWM field in /proc/pid/status, per benchmark
iteration or test scenario.

[akpm@linux-foundation.org: clarify behaviour in documentation]
Signed-off-by: Petr Cermak 
Cc: Bjorn Helgaas 
Cc: Primiano Tucci 
Cc: Petr Cermak 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: do not use mm->nr_pmds on !MMU configurations

2015-02-13T02:54:10+00:00

mm->nr_pmds doesn't make sense on !MMU configurations

Signed-off-by: Kirill A. Shutemov 
Cc: Guenter Roeck 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

vmscan: per memory cgroup slab shrinkers

2015-02-13T02:54:09+00:00

This patch adds SHRINKER_MEMCG_AWARE flag.  If a shrinker has this flag
set, it will be called per memory cgroup.  The memory cgroup to scan
objects from is passed in shrink_control->memcg.  If the memory cgroup
is NULL, a memcg aware shrinker is supposed to scan objects from the
global list.  Unaware shrinkers are only called on global pressure with
memcg=NULL.

Signed-off-by: Vladimir Davydov 
Cc: Dave Chinner 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Greg Thelen 
Cc: Glauber Costa 
Cc: Alexander Viro 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Tejun Heo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

pagewalk: add walk_page_vma()

2015-02-12T01:06:05+00:00

Introduce walk_page_vma(), which is useful for the callers which want to
walk over a given vma.  It's used by later patches.

Signed-off-by: Naoya Horiguchi 
Acked-by: Kirill A. Shutemov 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Cyrill Gorcunov 
Cc: Dave Hansen 
Cc: Pavel Emelyanov 
Cc: Benjamin Herrenschmidt 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

pagewalk: improve vma handling

2015-02-12T01:06:05+00:00

Current implementation of page table walker has a fundamental problem in
vma handling, which started when we tried to handle vma(VM_HUGETLB).
Because it's done in pgd loop, considering vma boundary makes code
complicated and bug-prone.

From the users viewpoint, some user checks some vma-related condition to
determine whether the user really does page walk over the vma.

In order to solve these, this patch moves vma check outside pgd loop and
introduce a new callback ->test_walk().

Signed-off-by: Naoya Horiguchi 
Acked-by: Kirill A. Shutemov 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Cyrill Gorcunov 
Cc: Dave Hansen 
Cc: Pavel Emelyanov 
Cc: Benjamin Herrenschmidt 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/pagewalk: remove pgd_entry() and pud_entry()

2015-02-12T01:06:05+00:00

Currently no user of page table walker sets ->pgd_entry() or
->pud_entry(), so checking their existence in each loop is just wasting
CPU cycle.  So let's remove it to reduce overhead.

Signed-off-by: Naoya Horiguchi 
Acked-by: Kirill A. Shutemov 
Cc: Andrea Arcangeli 
Cc: Cyrill Gorcunov 
Cc: Dave Hansen 
Cc: Kirill A. Shutemov 
Cc: Pavel Emelyanov 
Cc: Benjamin Herrenschmidt 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: gup: add __get_user_pages_unlocked to customize gup_flags

2015-02-12T01:06:05+00:00

Some callers (like KVM) may want to set the gup_flags like FOLL_HWPOSION
to get a proper -EHWPOSION retval instead of -EFAULT to take a more
appropriate action if get_user_pages runs into a memory failure.

Signed-off-by: Andrea Arcangeli 
Reviewed-by: Kirill A. Shutemov 
Cc: Andres Lagar-Cavilla 
Cc: Peter Feiner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: gup: add get_user_pages_locked and get_user_pages_unlocked

2015-02-12T01:06:05+00:00

FAULT_FOLL_ALLOW_RETRY allows the page fault to drop the mmap_sem for
reading to reduce the mmap_sem contention (for writing), like while
waiting for I/O completion.  The problem is that right now practically no
get_user_pages call uses FAULT_FOLL_ALLOW_RETRY, so we're not leveraging
that nifty feature.

Andres fixed it for the KVM page fault.  However get_user_pages_fast
remains uncovered, and 99% of other get_user_pages aren't using it either
(the only exception being FOLL_NOWAIT in KVM which is really nonblocking
and in fact it doesn't even release the mmap_sem).

So this patchsets extends the optimization Andres did in the KVM page
fault to the whole kernel.  It makes most important places (including
gup_fast) to use FAULT_FOLL_ALLOW_RETRY to reduce the mmap_sem hold times
during I/O.

The only few places that remains uncovered are drivers like v4l and other
exceptions that tends to work on their own memory and they're not working
on random user memory (for example like O_DIRECT that uses gup_fast and is
fully covered by this patch).

A follow up patch should probably also add a printk_once warning to
get_user_pages that should go obsolete and be phased out eventually.  The
"vmas" parameter of get_user_pages makes it fundamentally incompatible
with FAULT_FOLL_ALLOW_RETRY (vmas array becomes meaningless the moment the
mmap_sem is released).

While this is just an optimization, this becomes an absolute requirement
for the userfaultfd feature http://lwn.net/Articles/615086/ .

The userfaultfd allows to block the page fault, and in order to do so I
need to drop the mmap_sem first.  So this patch also ensures that all
memory where userfaultfd could be registered by KVM, the very first fault
(no matter if it is a regular page fault, or a get_user_pages) always has
FAULT_FOLL_ALLOW_RETRY set.  Then the userfaultfd blocks and it is waken
only when the pagetable is already mapped.  The second fault attempt after
the wakeup doesn't need FAULT_FOLL_ALLOW_RETRY, so it's ok to retry
without it.

This patch (of 5):

We can leverage the VM_FAULT_RETRY functionality in the page fault paths
better by using either get_user_pages_locked or get_user_pages_unlocked.

The former allows conversion of get_user_pages invocations that will have
to pass a "&locked" parameter to know if the mmap_sem was dropped during
the call.  Example from:

    down_read(&mm->mmap_sem);
    do_something()
    get_user_pages(tsk, mm, ..., pages, NULL);
    up_read(&mm->mmap_sem);

to:

    int locked = 1;
    down_read(&mm->mmap_sem);
    do_something()
    get_user_pages_locked(tsk, mm, ..., pages, &locked);
    if (locked)
        up_read(&mm->mmap_sem);

The latter is suitable only as a drop in replacement of the form:

    down_read(&mm->mmap_sem);
    get_user_pages(tsk, mm, ..., pages, NULL);
    up_read(&mm->mmap_sem);

into:

    get_user_pages_unlocked(tsk, mm, ..., pages);

Where tsk, mm, the intermediate "..." paramters and "pages" can be any
value as before.  Just the last parameter of get_user_pages (vmas) must be
NULL for get_user_pages_locked|unlocked to be usable (the latter original
form wouldn't have been safe anyway if vmas wasn't null, for the former we
just make it explicit by dropping the parameter).

If vmas is not NULL these two methods cannot be used.

Signed-off-by: Andrea Arcangeli 
Reviewed-by: Andres Lagar-Cavilla 
Reviewed-by: Peter Feiner 
Reviewed-by: Kirill A. Shutemov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: account pmd page tables to the process

2015-02-12T01:06:04+00:00

Dave noticed that unprivileged process can allocate significant amount of
memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
kernel doesn't account PMD tables to the process, only PTE.

The use-cases below use few tricks to allocate a lot of PMD page tables
while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.

	#include 
	#include 
	#include 
	#include 
	#include 
	#include 

	#define PUD_SIZE (1UL << 30)
	#define PMD_SIZE (1UL << 21)

	#define NR_PUD 130000

	int main(void)
	{
		char *addr = NULL;
		unsigned long i;

		prctl(PR_SET_THP_DISABLE);
		for (i = 0; i < NR_PUD ; i++) {
			addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
					MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
			if (addr == MAP_FAILED) {
				perror("mmap");
				break;
			}
			*addr = 'x';
			munmap(addr, PMD_SIZE);
			mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
					MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
			if (addr == MAP_FAILED)
				perror("re-mmap"), exit(1);
		}
		printf("PID %d consumed %lu KiB in PMD page tables\n",
				getpid(), i * 4096 >> 10);
		return pause();
	}

The patch addresses the issue by account PMD tables to the process the
same way we account PTE.

The main place where PMD tables is accounted is __pmd_alloc() and
free_pmd_range(). But there're few corner cases:

 - HugeTLB can share PMD page tables. The patch handles by accounting
   the table to all processes who share it.

 - x86 PAE pre-allocates few PMD tables on fork.

 - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
   check on exit(2).

Accounting only happens on configuration where PMD page table's level is
present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
counter value is used to calculate baseline for badness score by
oom-killer.

Signed-off-by: Kirill A. Shutemov 
Reported-by: Dave Hansen 
Cc: Hugh Dickins 
Reviewed-by: Cyrill Gorcunov 
Cc: Pavel Emelyanov 
Cc: David Rientjes 
Tested-by: Sedat Dilek 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds