linux-stable.git/include/linux/mm_types.h, branch v4.3.5

mm: drop __nocast from vm_flags_t definition

2015-09-08T22:35:28+00:00

__nocast does no good for vm_flags_t. It only produces useless sparse
warnings.

Let's drop it.

Signed-off-by: Kirill A. Shutemov 
Cc: Oleg Nesterov 
Acked-by: David Rientjes 
Cc: Johannes Weiner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

x86, mm: trace when an IPI is about to be sent

2015-09-04T23:54:41+00:00

When unmapping pages it is necessary to flush the TLB.  If that page was
accessed by another CPU then an IPI is used to flush the remote CPU.  That
is a lot of IPIs if kswapd is scanning and unmapping >100K pages per
second.

There already is a window between when a page is unmapped and when it is
TLB flushed.  This series increases the window so multiple pages can be
flushed using a single IPI.  This should be safe or the kernel is hosed
already.

Patch 1 simply made the rest of the series easier to write as ftrace
        could identify all the senders of TLB flush IPIS.

Patch 2 tracks what CPUs potentially map a PFN and then sends an IPI
        to flush the entire TLB.

Patch 3 tracks when there potentially are writable TLB entries that
        need to be batched differently

Patch 4 increases SWAP_CLUSTER_MAX to further batch flushes

The performance impact is documented in the changelogs but in the optimistic
case on a 4-socket machine the full series reduces interrupts from 900K
interrupts/second to 60K interrupts/second.

This patch (of 4):

It is easy to trace when an IPI is received to flush a TLB but harder to
detect what event sent it.  This patch makes it easy to identify the
source of IPIs being transmitted for TLB flushes on x86.

Signed-off-by: Mel Gorman 
Reviewed-by: Rik van Riel 
Reviewed-by: Dave Hansen 
Acked-by: Ingo Molnar 
Cc: Linus Torvalds 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct

2015-09-04T23:54:41+00:00

This adds the vm_userfaultfd_ctx to the vm_area_struct.

Signed-off-by: Andrea Arcangeli 
Acked-by: Pavel Emelyanov 
Cc: Sanidhya Kashyap 
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" 
Cc: Andres Lagar-Cavilla 
Cc: Dave Hansen 
Cc: Paolo Bonzini 
Cc: Rik van Riel 
Cc: Mel Gorman 
Cc: Andy Lutomirski 
Cc: Hugh Dickins 
Cc: Peter Feiner 
Cc: "Dr. David Alan Gilbert" 
Cc: Johannes Weiner 
Cc: "Huangpeng (Peter)" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: make page pfmemalloc check more robust

2015-08-21T21:30:10+00:00

Commit c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") added
checks for page->pfmemalloc to __skb_fill_page_desc():

        if (page->pfmemalloc && !page->mapping)
                skb->pfmemalloc = true;

It assumes page->mapping == NULL implies that page->pfmemalloc can be
trusted.  However, __delete_from_page_cache() can set set page->mapping
to NULL and leave page->index value alone.  Due to being in union, a
non-zero page->index will be interpreted as true page->pfmemalloc.

So the assumption is invalid if the networking code can see such a page.
And it seems it can.  We have encountered this with a NFS over loopback
setup when such a page is attached to a new skbuf.  There is no copying
going on in this case so the page confuses __skb_fill_page_desc which
interprets the index as pfmemalloc flag and the network stack drops
packets that have been allocated using the reserves unless they are to
be queued on sockets handling the swapping which is the case here and
that leads to hangs when the nfs client waits for a response from the
server which has been dropped and thus never arrive.

The struct page is already heavily packed so rather than finding another
hole to put it in, let's do a trick instead.  We can reuse the index
again but define it to an impossible value (-1UL).  This is the page
index so it should never see the value that large.  Replace all direct
users of page->pfmemalloc by page_is_pfmemalloc which will hide this
nastiness from unspoiled eyes.

The information will get lost if somebody wants to use page->index
obviously but that was the case before and the original code expected
that the information should be persisted somewhere else if that is
really needed (e.g.  what SLAB and SLUB do).

[akpm@linux-foundation.org: fix blooper in slub]
Fixes: c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb")
Signed-off-by: Michal Hocko 
Debugged-by: Vlastimil Babka 
Debugged-by: Jiri Bohac 
Cc: Eric Dumazet 
Cc: David Miller 
Acked-by: Mel Gorman 
Cc: 	[3.6+]
Signed-off-by: Andrew Morton 

Signed-off-by: Linus Torvalds

mm/net: Rename and move page fragment handling from net/ to mm/

2015-05-12T14:39:26+00:00

This change moves the __alloc_page_frag functionality out of the networking
stack and into the page allocation portion of mm.  The idea it so help make
this maintainable by placing it with other page allocation functions.

Since we are moving it from skbuff.c to page_alloc.c I have also renamed
the basic defines and structure from netdev_alloc_cache to page_frag_cache
to reflect that this is now part of a different kernel subsystem.

I have also added a simple __free_page_frag function which can handle
freeing the frags based on the skb->head pointer.  The model for this is
based off of __free_pages since we don't actually need to deal with all of
the cases that put_page handles.  I incorporated the virt_to_head_page call
and compound_order into the function as it actually allows for a signficant
size reduction by reducing code duplication.

Signed-off-by: Alexander Duyck 
Signed-off-by: David S. Miller

mm: rcu-protected get_mm_exe_file()

2015-04-17T13:04:07+00:00

This patch removes mm->mmap_sem from mm->exe_file read side.
Also it kills dup_mm_exe_file() and moves exe_file duplication into
dup_mmap() where both mmap_sems are locked.

[akpm@linux-foundation.org: fix comment typo]
Signed-off-by: Konstantin Khlebnikov 
Cc: Davidlohr Bueso 
Cc: Al Viro 
Cc: Oleg Nesterov 
Cc: "Paul E. McKenney" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: do not add nr_pmds into mm_struct if PMD is folded

2015-04-14T23:49:02+00:00

CONFIG_PGTABLE_LEVELS is now available on every architecture and we can
use it to check if we need to add nr_pmds into mm_struct.

Signed-off-by: Kirill A. Shutemov 
Tested-by: Guenter Roeck 
Cc: Richard Henderson 
Cc: Ivan Kokshaysky 
Cc: Matt Turner 
Cc: "David S. Miller" 
Cc: "H. Peter Anvin" 
Cc: "James E.J. Bottomley" 
Cc: Benjamin Herrenschmidt 
Cc: Catalin Marinas 
Cc: Chris Metcalf 
Cc: David Howells 
Cc: Fenghua Yu 
Cc: Geert Uytterhoeven 
Cc: Heiko Carstens 
Cc: Helge Deller 
Cc: Ingo Molnar 
Cc: Jeff Dike 
Cc: Kirill A. Shutemov 
Cc: Koichi Yasutake 
Cc: Martin Schwidefsky 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Ralf Baechle 
Cc: Richard Weinberger 
Cc: Russell King 
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: Will Deacon 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: account pmd page tables to the process

2015-02-12T01:06:04+00:00

Dave noticed that unprivileged process can allocate significant amount of
memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
kernel doesn't account PMD tables to the process, only PTE.

The use-cases below use few tricks to allocate a lot of PMD page tables
while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.

	#include 
	#include 
	#include 
	#include 
	#include 
	#include 

	#define PUD_SIZE (1UL << 30)
	#define PMD_SIZE (1UL << 21)

	#define NR_PUD 130000

	int main(void)
	{
		char *addr = NULL;
		unsigned long i;

		prctl(PR_SET_THP_DISABLE);
		for (i = 0; i < NR_PUD ; i++) {
			addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
					MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
			if (addr == MAP_FAILED) {
				perror("mmap");
				break;
			}
			*addr = 'x';
			munmap(addr, PMD_SIZE);
			mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
					MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
			if (addr == MAP_FAILED)
				perror("re-mmap"), exit(1);
		}
		printf("PID %d consumed %lu KiB in PMD page tables\n",
				getpid(), i * 4096 >> 10);
		return pause();
	}

The patch addresses the issue by account PMD tables to the process the
same way we account PTE.

The main place where PMD tables is accounted is __pmd_alloc() and
free_pmd_range(). But there're few corner cases:

 - HugeTLB can share PMD page tables. The patch handles by accounting
   the table to all processes who share it.

 - x86 PAE pre-allocates few PMD tables on fork.

 - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
   check on exit(2).

Accounting only happens on configuration where PMD page table's level is
present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
counter value is used to calculate baseline for badness score by
oom-killer.

Signed-off-by: Kirill A. Shutemov 
Reported-by: Dave Hansen 
Cc: Hugh Dickins 
Reviewed-by: Cyrill Gorcunov 
Cc: Pavel Emelyanov 
Cc: David Rientjes 
Tested-by: Sedat Dilek 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: add fields for compound destructor and order into struct page

2015-02-12T01:06:00+00:00

Currently, we use lru.next/lru.prev plus cast to access or set
destructor and order of compound page.

Let's replace it with explicit fields in struct page.

Signed-off-by: Kirill A. Shutemov 
Acked-by: Jerome Marchand 
Acked-by: Christoph Lameter 
Acked-by: Johannes Weiner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: replace vma->sharead.linear with vma->shared

2015-02-10T22:30:31+00:00

After removing vma->shared.nonlinear we have only one member of
vma->shared union, which doesn't make much sense.

This patch drops the union and move struct vma->shared.linear to
vma->shared.

Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds