linux-stable.git/include/linux/mm.h, branch v4.11.2

mm: move mm_percpu_wq initialization earlier

2017-04-01T00:13:30+00:00

Yang Li has reported that drain_all_pages triggers a WARN_ON which means
that this function is called earlier than the mm_percpu_wq is
initialized on arm64 with CMA configured:

  WARNING: CPU: 2 PID: 1 at mm/page_alloc.c:2423 drain_all_pages+0x244/0x25c
  Modules linked in:
  CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc1-next-20170310-00027-g64dfbc5 #127
  Hardware name: Freescale Layerscape 2088A RDB Board (DT)
  task: ffffffc07c4a6d00 task.stack: ffffffc07c4a8000
  PC is at drain_all_pages+0x244/0x25c
  LR is at start_isolate_page_range+0x14c/0x1f0
  [...]
   drain_all_pages+0x244/0x25c
   start_isolate_page_range+0x14c/0x1f0
   alloc_contig_range+0xec/0x354
   cma_alloc+0x100/0x1fc
   dma_alloc_from_contiguous+0x3c/0x44
   atomic_pool_init+0x7c/0x208
   arm64_dma_init+0x44/0x4c
   do_one_initcall+0x38/0x128
   kernel_init_freeable+0x1a0/0x240
   kernel_init+0x10/0xfc
   ret_from_fork+0x10/0x20

Fix this by moving the whole setup_vmstat which is an initcall right now
to init_mm_internals which will be called right after the WQ subsystem
is initialized.

Link: http://lkml.kernel.org/r/20170315164021.28532-1-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Reported-by: Yang Li 
Tested-by: Yang Li 
Tested-by: Xiaolong Ye 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: Tetsuo Handa 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: convert generic code to 5-level paging

2017-03-09T19:48:47+00:00

Convert all non-architecture-specific code to 5-level paging.

It's mostly mechanical adding handling one more page table level in
places where we deal with pud_t.

Signed-off-by: Kirill A. Shutemov 
Acked-by: Michal Hocko 
Signed-off-by: Linus Torvalds

asm-generic: introduce 5level-fixup.h

2017-03-09T19:48:47+00:00

We are going to switch core MM to 5-level paging abstraction.

This is preparation step which adds 
As with 4level-fixup.h, the new header allows quickly make all
architectures compatible with 5-level paging in core MM.

In long run we would like to switch architectures to properly folded p4d
level by using , but it requires more
changes to arch-specific code.

Signed-off-by: Kirill A. Shutemov 
Acked-by: Michal Hocko 
Signed-off-by: Linus Torvalds

mm: remove shmem_mapping() shmem_zero_setup() duplicates

2017-02-25T01:46:56+00:00

Remove the prototypes for shmem_mapping() and shmem_zero_setup() from
linux/mm.h, since they are already provided in linux/shmem_fs.h.  But
shmem_fs.h must then provide the inline stub for shmem_mapping() when
CONFIG_SHMEM is not set, and a few more cfiles now need to #include it.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1702081658250.1549@eggly.anvils
Signed-off-by: Hugh Dickins 
Cc: Johannes Weiner 
Cc: Michal Simek 
Cc: Michael Ellerman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, madvise: fail with ENOMEM when splitting vma will hit max_map_count

2017-02-25T01:46:55+00:00

If madvise(2) advice will result in the underlying vma being split and
the number of areas mapped by the process will exceed
/proc/sys/vm/max_map_count as a result, return ENOMEM instead of EAGAIN.

EAGAIN is returned by madvise(2) when a kernel resource, such as slab,
is temporarily unavailable.  It indicates that userspace should retry
the advice in the near future.  This is important for advice such as
MADV_DONTNEED which is often used by malloc implementations to free
memory back to the system: we really do want to free memory back when
madvise(2) returns EAGAIN because slab allocations (for vmas, anon_vmas,
or mempolicies) cannot be allocated.

Encountering /proc/sys/vm/max_map_count is not a temporary failure,
however, so return ENOMEM to indicate this is a more serious issue.  A
followup patch to the man page will specify this behavior.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701241431120.42507@chino.kir.corp.google.com
Signed-off-by: David Rientjes 
Cc: Jonathan Corbet 
Cc: Johannes Weiner 
Cc: Jerome Marchand 
Cc: "Kirill A. Shutemov" 
Cc: Michael Kerrisk 
Cc: Anshuman Khandual 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

userfaultfd: non-cooperative: add event for memory unmaps

2017-02-25T01:46:55+00:00

When a non-cooperative userfaultfd monitor copies pages in the
background, it may encounter regions that were already unmapped.
Addition of UFFD_EVENT_UNMAP allows the uffd monitor to track precisely
changes in the virtual memory layout.

Since there might be different uffd contexts for the affected VMAs, we
first should create a temporary representation for the unmap event for
each uffd context and then notify them one by one to the appropriate
userfault file descriptors.

The event notification occurs after the mmap_sem has been released.

[arnd@arndb.de: fix nommu build]
  Link: http://lkml.kernel.org/r/20170203165141.3665284-1-arnd@arndb.de
[mhocko@suse.com: fix nommu build]
  Link: http://lkml.kernel.org/r/20170202091503.GA22823@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/1485542673-24387-3-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport 
Signed-off-by: Michal Hocko 
Signed-off-by: Arnd Bergmann 
Acked-by: Hillf Danton 
Cc: Andrea Arcangeli 
Cc: "Dr. David Alan Gilbert" 
Cc: Mike Kravetz 
Cc: Pavel Emelyanov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: replace FAULT_FLAG_SIZE with parameter to huge_fault

2017-02-25T01:46:54+00:00

Since the introduction of FAULT_FLAG_SIZE to the vm_fault flag, it has
been somewhat painful with getting the flags set and removed at the
correct locations.  More than one kernel oops was introduced due to
difficulties of getting the placement correctly.

Remove the flag values and introduce an input parameter to huge_fault
that indicates the size of the page entry.  This makes the code easier
to trace and should avoid the issues we see with the fault flags where
removal of the flag was necessary in the fallback paths.

Link: http://lkml.kernel.org/r/148615748258.43180.1690152053774975329.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang 
Tested-by: Dan Williams 
Reviewed-by: Jan Kara 
Cc: Matthew Wilcox 
Cc: Dave Hansen 
Cc: Vlastimil Babka 
Cc: Ross Zwisler 
Cc: Kirill A. Shutemov 
Cc: Nilesh Choudhury 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Thomas Gleixner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, x86: add support for PUD-sized transparent hugepages

2017-02-25T01:46:54+00:00

The current transparent hugepage code only supports PMDs.  This patch
adds support for transparent use of PUDs with DAX.  It does not include
support for anonymous pages.  x86 support code also added.

Most of this patch simply parallels the work that was done for huge
PMDs.  The only major difference is how the new ->pud_entry method in
mm_walk works.  The ->pmd_entry method replaces the ->pte_entry method,
whereas the ->pud_entry method works along with either ->pmd_entry or
->pte_entry.  The pagewalk code takes care of locking the PUD before
calling ->pud_walk, so handlers do not need to worry whether the PUD is
stable.

[dave.jiang@intel.com: fix SMP x86 32bit build for native_pud_clear()]
  Link: http://lkml.kernel.org/r/148719066814.31111.3239231168815337012.stgit@djiang5-desk3.ch.intel.com
[dave.jiang@intel.com: native_pud_clear missing on i386 build]
  Link: http://lkml.kernel.org/r/148640375195.69754.3315433724330910314.stgit@djiang5-desk3.ch.intel.com
Link: http://lkml.kernel.org/r/148545059381.17912.8602162635537598445.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Matthew Wilcox 
Signed-off-by: Dave Jiang 
Tested-by: Alexander Kapshuk 
Cc: Dave Hansen 
Cc: Vlastimil Babka 
Cc: Jan Kara 
Cc: Dan Williams 
Cc: Ross Zwisler 
Cc: Kirill A. Shutemov 
Cc: Nilesh Choudhury 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Thomas Gleixner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm,fs,dax: change ->pmd_fault to ->huge_fault

2017-02-25T01:46:54+00:00

Patch series "1G transparent hugepage support for device dax", v2.

The following series implements support for 1G trasparent hugepage on
x86 for device dax.  The bulk of the code was written by Mathew Wilcox a
while back supporting transparent 1G hugepage for fs DAX.  I have
forward ported the relevant bits to 4.10-rc.  The current submission has
only the necessary code to support device DAX.

Comments from Dan Williams: So the motivation and intended user of this
functionality mirrors the motivation and users of 1GB page support in
hugetlbfs.  Given expected capacities of persistent memory devices an
in-memory database may want to reduce tlb pressure beyond what they can
already achieve with 2MB mappings of a device-dax file.  We have
customer feedback to that effect as Willy mentioned in his previous
version of these patches [1].

[1]: https://lkml.org/lkml/2016/1/31/52

Comments from Nilesh @ Oracle:

There are applications which have a process model; and if you assume
10,000 processes attempting to mmap all the 6TB memory available on a
server; we are looking at the following:

processes         : 10,000
memory            :    6TB
pte @ 4k page size: 8 bytes / 4K of memory * #processes = 6TB / 4k * 8 * 10000 = 1.5GB * 80000 = 120,000GB
pmd @ 2M page size: 120,000 / 512 = ~240GB
pud @ 1G page size: 240GB / 512 = ~480MB

As you can see with 2M pages, this system will use up an exorbitant
amount of DRAM to hold the page tables; but the 1G pages finally brings
it down to a reasonable level.  Memory sizes will keep increasing; so
this number will keep increasing.

An argument can be made to convert the applications from process model
to thread model, but in the real world that may not be always practical.
Hopefully this helps explain the use case where this is valuable.

This patch (of 3):

In preparation for adding the ability to handle PUD pages, convert
vm_operations_struct.pmd_fault to vm_operations_struct.huge_fault.  The
vm_fault structure is extended to include a union of the different page
table pointers that may be needed, and three flag bits are reserved to
indicate which type of pointer is in the union.

[ross.zwisler@linux.intel.com: remove unused function ext4_dax_huge_fault()]
  Link: http://lkml.kernel.org/r/1485813172-7284-1-git-send-email-ross.zwisler@linux.intel.com
[dave.jiang@intel.com: clear PMD or PUD size flags when in fall through path]
  Link: http://lkml.kernel.org/r/148589842696.5820.16078080610311444794.stgit@djiang5-desk3.ch.intel.com
Link: http://lkml.kernel.org/r/148545058784.17912.6353162518188733642.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Matthew Wilcox 
Signed-off-by: Dave Jiang 
Signed-off-by: Ross Zwisler 
Cc: Dave Hansen 
Cc: Vlastimil Babka 
Cc: Jan Kara 
Cc: Dan Williams 
Cc: Kirill A. Shutemov 
Cc: Nilesh Choudhury 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Thomas Gleixner 
Cc: Dave Jiang 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, fs: reduce fault, page_mkwrite, and pfn_mkwrite to take only vmf

2017-02-25T01:46:54+00:00

->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
take a vma and vmf parameter when the vma already resides in vmf.

Remove the vma parameter to simplify things.

[arnd@arndb.de: fix ARM build]
  Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang 
Signed-off-by: Arnd Bergmann 
Reviewed-by: Ross Zwisler 
Cc: Theodore Ts'o 
Cc: Darrick J. Wong 
Cc: Matthew Wilcox 
Cc: Dave Hansen 
Cc: Christoph Hellwig 
Cc: Jan Kara 
Cc: Dan Williams 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds