linux.git/mm/huge_memory.c, branch v4.1

thp: cleanup khugepaged startup

2015-04-15T23:35:19+00:00

Few trivial cleanups:

 - no need to call set_recommended_min_free_kbytes() from
   late_initcall() -- start_khugepaged() calls it;

 - no need to call set_recommended_min_free_kbytes() from
   start_khugepaged() if khugepaged is not started;

 - there isn't much point in running start_khugepaged() if we've just
   set transparent_hugepage_flags to zero;

 - start_khugepaged() is misnamed -- it also used to stop the thread;

Signed-off-by: Kirill A. Shutemov 
Cc: David Rientjes 
Cc: Andrea Arcangeli 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

thp: do not adjust zone water marks if khugepaged is not started

2015-04-15T23:35:19+00:00

set_recommended_min_free_kbytes() adjusts zone water marks to be suitable
for khugepaged. We avoid doing this if khugepaged is disabled, but don't
catch the case when khugepaged is failed to start.

Let's address this by checking khugepaged_thread instead of
khugepaged_enabled() in set_recommended_min_free_kbytes().
It's NULL if the kernel thread is stopped or failed to start.

Signed-off-by: Kirill A. Shutemov 
Cc: David Rientjes 
Cc: Andrea Arcangeli 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

thp: handle errors in hugepage_init() properly

2015-04-15T23:35:18+00:00

We miss error-handling in few cases hugepage_init(). Let's fix that.

Signed-off-by: Kirill A. Shutemov 
Cc: Andrea Arcangeli 
Acked-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: remove rest of ACCESS_ONCE() usages

2015-04-15T23:35:18+00:00

We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/
tree since it doesn't work reliably on non-scalar types.

This patch removes the rest of the usages of ACCESS_ONCE, and use the new
READ_ONCE API for the read accesses.  This makes things cleaner, instead
of using separate/multiple sets of APIs.

Signed-off-by: Jason Low 
Acked-by: Michal Hocko 
Acked-by: Davidlohr Bueso 
Acked-by: Rik van Riel 
Reviewed-by: Christian Borntraeger 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memcg: sync allocation and memcg charge gfp flags for THP

2015-04-15T23:35:17+00:00

memcg currently uses hardcoded GFP_TRANSHUGE gfp flags for all THP
charges.  THP allocations, however, might be using different flags
depending on /sys/kernel/mm/transparent_hugepage/{,khugepaged/}defrag and
the current allocation context.

The primary difference is that defrag configured to "madvise" value will
clear __GFP_WAIT flag from the core gfp mask to make the allocation
lighter for all mappings which are not backed by VM_HUGEPAGE vmas.  If
memcg charge path ignores this fact we will get light allocation but the a
potential memcg reclaim would kill the whole point of the configuration.

Fix the mismatch by providing the same gfp mask used for the allocation to
the charge functions.  This is quite easy for all paths except for
hugepaged kernel thread with !CONFIG_NUMA which is doing a pre-allocation
long before the allocated page is used in collapse_huge_page via
khugepaged_alloc_page.  To prevent from cluttering the whole code path
from khugepaged_do_scan we simply return the current flags as per
khugepaged_defrag() value which might have changed since the
preallocation.  If somebody changed the value of the knob we would charge
differently but this shouldn't happen often and it is definitely not
critical because it would only lead to a reduced success rate of one-off
THP promotion.

[akpm@linux-foundation.org: fix weird code layout while we're there]
[rientjes@google.com: clean up around alloc_hugepage_gfpmask()]
Signed-off-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Cc: Johannes Weiner 
Acked-by: David Rientjes 
Signed-off-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, thp: really limit transparent hugepage allocation to local node

2015-04-14T23:49:03+00:00

Commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on local
node") restructured alloc_hugepage_vma() with the intent of only
allocating transparent hugepages locally when there was not an effective
interleave mempolicy.

alloc_pages_exact_node() does not limit the allocation to the single node,
however, but rather prefers it.  This is because __GFP_THISNODE is not set
which would cause the node-local nodemask to be passed.  Without it, only
a nodemask that prefers the local node is passed.

Fix this by passing __GFP_THISNODE and falling back to small pages when
the allocation fails.

Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target
node") suffers from a similar problem for khugepaged, which is also fixed.

Fixes: 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node")
Fixes: 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node")
Signed-off-by: David Rientjes 
Acked-by: Vlastimil Babka 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: Joonsoo Kim 
Cc: Johannes Weiner 
Cc: Mel Gorman 
Cc: Pravin Shelar 
Cc: Jarno Rajahalme 
Cc: Li Zefan 
Cc: Greg Thelen 
Cc: Tejun Heo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: incorporate zero pages into transparent huge pages

2015-04-14T23:49:01+00:00

This patch improves THP collapse rates, by allowing zero pages.

Currently THP can collapse 4kB pages into a THP when there are up to
khugepaged_max_ptes_none pte_none ptes in a 2MB range.  This patch counts
pte none and mapped zero pages with the same variable.

The patch was tested with a program that allocates 800MB of
memory, and performs interleaved reads and writes, in a pattern
that causes some 2MB areas to first see read accesses, resulting
in the zero pfn being mapped there.

To simulate memory fragmentation at allocation time, I modified
do_huge_pmd_anonymous_page to return VM_FAULT_FALLBACK for read faults.

Without the patch, only %50 of the program was collapsed into THP and the
percentage did not increase over time.

With this patch after 10 minutes of waiting khugepaged had collapsed %99
of the program's memory.

[aarcange@redhat.com: fix bogus BUG()]
Signed-off-by: Ebru Akagunduz 
Reviewed-by: Rik van Riel 
Reviewed-by: Andrea Arcangeli 
Acked-by: Kirill A. Shutemov 
Acked-by: Vlastimil Babka 
Cc: Hugh Dickins 
Cc: Johannes Weiner 
Cc: Sasha Levin 
Cc: David Rientjes 
Cc: Mel Gorman 
Signed-off-by: Andrea Arcangeli 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: rename FOLL_MLOCK to FOLL_POPULATE

2015-04-14T23:48:59+00:00

After commit a1fde08c74e9 ("VM: skip the stack guard page lookup in
get_user_pages only for mlock") FOLL_MLOCK has lost its original
meaning: we don't necessarily mlock the page if the flags is set -- we
also take VM_LOCKED into consideration.

Since we use the same codepath for __mm_populate(), let's rename
FOLL_MLOCK to FOLL_POPULATE.

Signed-off-by: Kirill A. Shutemov 
Acked-by: Linus Torvalds 
Acked-by: David Rientjes 
Cc: Michel Lespinasse 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: numa: mark huge PTEs young when clearing NUMA hinting faults

2015-03-25T23:20:31+00:00

Base PTEs are marked young when the NUMA hinting information is cleared
but the same does not happen for huge pages which this patch addresses.

Note that migrated pages are not marked young as the base page migration
code does not assume that migrated pages have been referenced.  This
could be addressed but beyond the scope of this series which is aimed at
Dave Chinners shrink workload that is unlikely to be affected by this
issue.

Signed-off-by: Mel Gorman 
Cc: Dave Chinner 
Cc: Ingo Molnar 
Cc: Aneesh Kumar 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: numa: slow PTE scan rate if migration failures occur

2015-03-25T23:20:31+00:00

Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226

  Across the board the 4.0-rc1 numbers are much slower, and the degradation
  is far worse when using the large memory footprint configs. Perf points
  straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:

   -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
      - default_send_IPI_mask_sequence_phys
         - 99.99% physflat_send_IPI_mask
            - 99.37% native_send_call_func_ipi
                 smp_call_function_many
               - native_flush_tlb_others
                  - 99.85% flush_tlb_page
                       ptep_clear_flush
                       try_to_unmap_one
                       rmap_walk
                       try_to_unmap
                       migrate_pages
                       migrate_misplaced_page
                     - handle_mm_fault
                        - 99.73% __do_page_fault
                             trace_do_page_fault
                             do_async_page_fault
                           + async_page_fault
              0.63% native_send_call_func_single_ipi
                 generic_exec_single
                 smp_call_function_single

This is showing excessive migration activity even though excessive
migrations are meant to get throttled.  Normally, the scan rate is tuned
on a per-task basis depending on the locality of faults.  However, if
migrations fail for any reason then the PTE scanner may scan faster if
the faults continue to be remote.  This means there is higher system CPU
overhead and fault trapping at exactly the time we know that migrations
cannot happen.  This patch tracks when migration failures occur and
slows the PTE scanner.

Signed-off-by: Mel Gorman 
Reported-by: Dave Chinner 
Tested-by: Dave Chinner 
Cc: Ingo Molnar 
Cc: Aneesh Kumar 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds