linux-stable.git/mm/memory.c, branch v4.1.41

mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED

2016-03-09T18:15:11+00:00

[ Upstream commit ad33bb04b2a6cee6c1f99fabb15cddbf93ff0433 ]

pmd_trans_unstable()/pmd_none_or_trans_huge_or_clear_bad() were
introduced to locklessy (but atomically) detect when a pmd is a regular
(stable) pmd or when the pmd is unstable and can infinitely transition
from pmd_none() and pmd_trans_huge() from under us, while only holding
the mmap_sem for reading (for writing not).

While holding the mmap_sem only for reading, MADV_DONTNEED can run from
under us and so before we can assume the pmd to be a regular stable pmd
we need to compare it against pmd_none() and pmd_trans_huge() in an
atomic way, with pmd_trans_unstable().  The old pmd_trans_huge() left a
tiny window for a race.

Useful applications are unlikely to notice the difference as doing
MADV_DONTNEED concurrently with a page fault would lead to undefined
behavior.

[akpm@linux-foundation.org: tidy up comment grammar/layout]
Signed-off-by: Andrea Arcangeli 
Reported-by: Kirill A. Shutemov 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

Signed-off-by: Sasha Levin

mm: avoid setting up anonymous pages into file mapping

2015-08-03T16:29:19+00:00

commit 6b7339f4c31ad69c8e9c0b2859276e22cf72176d upstream.

Reading page fault handler code I've noticed that under right
circumstances kernel would map anonymous pages into file mappings: if
the VMA doesn't have vm_ops->fault() and the VMA wasn't fully populated
on ->mmap(), kernel would handle page fault to not populated pte with
do_anonymous_page().

Let's change page fault handler to use do_anonymous_page() only on
anonymous VMA (->vm_ops == NULL) and make sure that the VMA is not
shared.

For file mappings without vm_ops->fault() or shred VMA without vm_ops,
page fault on pte_none() entry would lead to SIGBUS.

Signed-off-by: Kirill A. Shutemov 
Acked-by: Oleg Nesterov 
Cc: Andrew Morton 
Cc: Willy Tarreau 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP

2015-04-15T23:35:20+00:00

This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to
get notified when access is a write to a read-only PFN.

This can happen if we mmap() a file then first mmap-read from it to
page-in a read-only PFN, than we mmap-write to the same page.

We need this functionality to fix a DAX bug, where in the scenario above
we fail to set ctime/mtime though we modified the file.  An xfstest is
attached to this patchset that shows the failure and the fix.  (A DAX
patch will follow)

This functionality is extra important for us, because upon dirtying of a
pmem page we also want to RDMA the page to a remote cluster node.

We define a new pfn_mkwrite and do not reuse page_mkwrite because
  1 - The name ;-)
  2 - But mainly because it would take a very long and tedious
      audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
      users. To make sure they do not now CRASH. For example current
      DAX code (which this is for) would crash.
      If we would want to reuse page_mkwrite, We will need to first
      patch all users, so to not-crash-on-no-page. Then enable this
      patch. But even if I did that I would not sleep so well at night.
      Adding a new vector is the safest thing to do, and is not that
      expensive. an extra pointer at a static function vector per driver.
      Also the new vector is better for performance, because else we
      Will call all current Kernel vectors, so to:
        check-ha-no-page-do-nothing and return.

No need to call it from do_shared_fault because do_wp_page is called to
change pte permissions anyway.

Signed-off-by: Yigal Korman 
Signed-off-by: Boaz Harrosh 
Acked-by: Kirill A. Shutemov 
Cc: Matthew Wilcox 
Cc: Jan Kara 
Cc: Hugh Dickins 
Cc: Mel Gorman 
Cc: Dave Chinner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory: also print a_ops->readpage in print_bad_pte()

2015-04-15T23:35:20+00:00

A lot of filesystems use generic_file_mmap() and filemap_fault(),
f_op->mmap and vm_ops->fault aren't enough to identify filesystem.

This prints file name, vm_ops->fault, f_op->mmap and a_ops->readpage
(which is almost always implemented and filesystem-specific).

Example:

[   23.676410] BUG: Bad page map in process sh  pte:1b7e6025 pmd:19bbd067
[   23.676887] page:ffffea00006df980 count:4 mapcount:1 mapping:ffff8800196426c0 index:0x97
[   23.677481] flags: 0x10000000000000c(referenced|uptodate)
[   23.677896] page dumped because: bad pte
[   23.678205] addr:00007f52fcb17000 vm_flags:00000075 anon_vma:          (null) mapping:ffff8800196426c0 index:97
[   23.678922] file:libc-2.19.so fault:filemap_fault mmap:generic_file_readonly_mmap readpage:v9fs_vfs_readpage

[akpm@linux-foundation.org: use pr_alert, per Kirill]
Signed-off-by: Konstantin Khlebnikov 
Cc: Sasha Levin 
Acked-by: Kirill A. Shutemov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: remove rest of ACCESS_ONCE() usages

2015-04-15T23:35:18+00:00

We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/
tree since it doesn't work reliably on non-scalar types.

This patch removes the rest of the usages of ACCESS_ONCE, and use the new
READ_ONCE API for the read accesses.  This makes things cleaner, instead
of using separate/multiple sets of APIs.

Signed-off-by: Jason Low 
Acked-by: Michal Hocko 
Acked-by: Davidlohr Bueso 
Acked-by: Rik van Riel 
Reviewed-by: Christian Borntraeger 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: refactor do_wp_page handling of shared vma into a function

2015-04-14T23:49:03+00:00

The do_wp_page function is extremely long.  Extract the logic for
handling a page belonging to a shared vma into a function of its own.

This helps the readability of the code, without doing any functional
change in it.

Signed-off-by: Shachar Raindel 
Acked-by: Linus Torvalds 
Acked-by: Kirill A. Shutemov 
Acked-by: Rik van Riel 
Acked-by: Andi Kleen 
Acked-by: Haggai Eran 
Acked-by: Johannes Weiner 
Cc: Mel Gorman 
Cc: Matthew Wilcox 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Andrea Arcangeli 
Cc: Peter Feiner 
Cc: Michel Lespinasse 
Reviewed-by: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: refactor do_wp_page, extract the page copy flow

2015-04-14T23:49:03+00:00

In some cases, do_wp_page had to copy the page suffering a write fault
to a new location.  If the function logic decided that to do this, it
was done by jumping with a "goto" operation to the relevant code block.
This made the code really hard to understand.  It is also against the
kernel coding style guidelines.

This patch extracts the page copy and page table update logic to a
separate function.  It also clean up the naming, from "gotten" to
"wp_page_copy", and adds few comments.

Signed-off-by: Shachar Raindel 
Acked-by: Linus Torvalds 
Acked-by: Kirill A. Shutemov 
Acked-by: Rik van Riel 
Acked-by: Andi Kleen 
Acked-by: Haggai Eran 
Acked-by: Johannes Weiner 
Cc: Mel Gorman 
Cc: Matthew Wilcox 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Andrea Arcangeli 
Cc: Peter Feiner 
Cc: Michel Lespinasse 
Reviewed-by: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: refactor do_wp_page - rewrite the unlock flow

2015-04-14T23:49:03+00:00

When do_wp_page is ending, in several cases it needs to unlock the pages
and ptls it was accessing.

Currently, this logic was "called" by using a goto jump.  This makes
following the control flow of the function harder.  Readability was
further hampered by the unlock case containing large amount of logic
needed only in one of the 3 cases.

Using goto for cleanup is generally allowed.  However, moving the
trivial unlocking flows to the relevant call sites allow deeper
refactoring in the next patch.

Signed-off-by: Shachar Raindel 
Acked-by: Linus Torvalds 
Acked-by: Kirill A. Shutemov 
Acked-by: Rik van Riel 
Acked-by: Andi Kleen 
Acked-by: Haggai Eran 
Acked-by: Johannes Weiner 
Cc: Mel Gorman 
Cc: Matthew Wilcox 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Andrea Arcangeli 
Cc: Peter Feiner 
Cc: Michel Lespinasse 
Reviewed-by: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: refactor do_wp_page, extract the reuse case

2015-04-14T23:49:03+00:00

Currently do_wp_page contains 265 code lines.  It also contains 9 goto
statements, of which 5 are targeting labels which are not cleanup
related.  This makes the function extremely difficult to understand.

The following patches are an attempt at breaking the function to its
basic components, and making it easier to understand.

The patches are straight forward function extractions from do_wp_page.
As we extract functions, we remove unneeded parameters and simplify the
code as much as possible.  However, the functionality is supposed to
remain completely unchanged.  The patches also attempt to document the
functionality of each extracted function.  In patch 2, we split the
unlock logic to the contain logic relevant to specific needs of each use
case, instead of having huge number of conditional decisions in a single
unlock flow.

This patch (of 4):

When do_wp_page is ending, in several cases it needs to reuse the existing
page.  This is achieved by making the page table writable, and possibly
updating the page-cache state.

Currently, this logic was "called" by using a goto jump.  This makes
following the control flow of the function harder.  It is also against the
coding style guidelines for using goto.

As the code can easily be refactored into a specialized function, refactor
it out and simplify the code flow in do_wp_page.

Acked-by: Linus Torvalds 
Acked-by: Kirill A. Shutemov 
Acked-by: Rik van Riel 
Acked-by: Andi Kleen 
Acked-by: Haggai Eran 
Acked-by: Johannes Weiner 
Cc: Mel Gorman 
Cc: Matthew Wilcox 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Andrea Arcangeli 
Cc: Peter Feiner 
Cc: Michel Lespinasse 
Reviewed-by: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: numa: slow PTE scan rate if migration failures occur

2015-03-25T23:20:31+00:00

Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226

  Across the board the 4.0-rc1 numbers are much slower, and the degradation
  is far worse when using the large memory footprint configs. Perf points
  straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:

   -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
      - default_send_IPI_mask_sequence_phys
         - 99.99% physflat_send_IPI_mask
            - 99.37% native_send_call_func_ipi
                 smp_call_function_many
               - native_flush_tlb_others
                  - 99.85% flush_tlb_page
                       ptep_clear_flush
                       try_to_unmap_one
                       rmap_walk
                       try_to_unmap
                       migrate_pages
                       migrate_misplaced_page
                     - handle_mm_fault
                        - 99.73% __do_page_fault
                             trace_do_page_fault
                             do_async_page_fault
                           + async_page_fault
              0.63% native_send_call_func_single_ipi
                 generic_exec_single
                 smp_call_function_single

This is showing excessive migration activity even though excessive
migrations are meant to get throttled.  Normally, the scan rate is tuned
on a per-task basis depending on the locality of faults.  However, if
migrations fail for any reason then the PTE scanner may scan faster if
the faults continue to be remote.  This means there is higher system CPU
overhead and fault trapping at exactly the time we know that migrations
cannot happen.  This patch tracks when migration failures occur and
slows the PTE scanner.

Signed-off-by: Mel Gorman 
Reported-by: Dave Chinner 
Tested-by: Dave Chinner 
Cc: Ingo Molnar 
Cc: Aneesh Kumar 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds