linux.git/mm/huge_memory.c, branch v4.17

mm/huge_memory.c: __split_huge_page() use atomic ClearPageDirty()

2018-06-02T16:33:47+00:00

Swapping load on huge=always tmpfs (with khugepaged tuned up to be very
eager, but I'm not sure that is relevant) soon hung uninterruptibly,
waiting for page lock in shmem_getpage_gfp()'s find_lock_entry(), most
often when "cp -a" was trying to write to a smallish file.  Debug showed
that the page in question was not locked, and page->mapping NULL by now,
but page->index consistent with having been in a huge page before.

Reproduced in minutes on a 4.15 kernel, even with 4.17's 605ca5ede764
("mm/huge_memory.c: reorder operations in __split_huge_page_tail()") added
in; but took hours to reproduce on a 4.17 kernel (no idea why).

The culprit proved to be the __ClearPageDirty() on tails beyond i_size in
__split_huge_page(): the non-atomic __bitoperation may have been safe when
4.8's baa355fd3314 ("thp: file pages support for split_huge_page()")
introduced it, but liable to erase PageWaiters after 4.10's 62906027091f
("mm: add PageWaiters indicating tasks are waiting for a page bit").

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805291841070.3197@eggly.anvils
Fixes: 62906027091f ("mm: add PageWaiters indicating tasks are waiting for a page bit")
Signed-off-by: Hugh Dickins 
Acked-by: Kirill A. Shutemov 
Cc: Konstantin Khlebnikov 
Cc: Nicholas Piggin 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: enable thp migration for shmem thp

2018-04-21T00:18:35+00:00

My testing for the latest kernel supporting thp migration showed an
infinite loop in offlining the memory block that is filled with shmem
thps.  We can get out of the loop with a signal, but kernel should return
with failure in this case.

What happens in the loop is that scan_movable_pages() repeats returning
the same pfn without any progress.  That's because page migration always
fails for shmem thps.

In memory offline code, memory blocks containing unmovable pages should be
prevented from being offline targets by has_unmovable_pages() inside
start_isolate_page_range().  So it's possible to change migratability for
non-anonymous thps to avoid the issue, but it introduces more complex and
thp-specific handling in migration code, so it might not good.

So this patch is suggesting to fix the issue by enabling thp migration for
shmem thp.  Both of anon/shmem thp are migratable so we don't need
precheck about the type of thps.

Link: http://lkml.kernel.org/r/20180406030706.GA2434@hori1.linux.bs1.fc.nec.co.jp
Fixes: commit 72b39cfc4d75 ("mm, memory_hotplug: do not fail offlining too early")
Signed-off-by: Naoya Horiguchi 
Acked-by: Kirill A. Shutemov 
Cc: Zi Yan 
Cc: Vlastimil Babka 
Cc: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

page cache: use xa_lock

2018-04-11T17:28:39+00:00

Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root.  Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.

[willy@infradead.org: fix nds32, fs/dax.c]
  Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox 
Acked-by: Jeff Layton 
Cc: Darrick J. Wong 
Cc: Dave Chinner 
Cc: Ryusuke Konishi 
Cc: Will Deacon 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: unclutter THP migration

2018-04-11T17:28:32+00:00

THP migration is hacked into the generic migration with rather
surprising semantic.  The migration allocation callback is supposed to
check whether the THP can be migrated at once and if that is not the
case then it allocates a simple page to migrate.  unmap_and_move then
fixes that up by spliting the THP into small pages while moving the head
page to the newly allocated order-0 page.  Remaning pages are moved to
the LRU list by split_huge_page.  The same happens if the THP allocation
fails.  This is really ugly and error prone [1].

I also believe that split_huge_page to the LRU lists is inherently wrong
because all tail pages are not migrated.  Some callers will just work
around that by retrying (e.g.  memory hotplug).  There are other pfn
walkers which are simply broken though.  e.g. madvise_inject_error will
migrate head and then advances next pfn by the huge page size.
do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
will simply split the THP before migration if the THP migration is not
supported then falls back to single page migration but it doesn't handle
tail pages if the THP migration path is not able to allocate a fresh THP
so we end up with ENOMEM and fail the whole migration which is a
questionable behavior.  Page compaction doesn't try to migrate large
pages so it should be immune.

This patch tries to unclutter the situation by moving the special THP
handling up to the migrate_pages layer where it actually belongs.  We
simply split the THP page into the existing list if unmap_and_move fails
with ENOMEM and retry.  So we will _always_ migrate all THP subpages and
specific migrate_pages users do not have to deal with this case in a
special way.

[1] http://lkml.kernel.org/r/20171121021855.50525-1-zi.yan@sent.com

Link: http://lkml.kernel.org/r/20180103082555.14592-4-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Acked-by: Kirill A. Shutemov 
Reviewed-by: Zi Yan 
Cc: Andrea Reale 
Cc: Anshuman Khandual 
Cc: Mike Kravetz 
Cc: Naoya Horiguchi 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg, thp: do not invoke oom killer on thp charges

2018-04-11T17:28:31+00:00

A THP memcg charge can trigger the oom killer since 2516035499b9 ("mm,
thp: remove __GFP_NORETRY from khugepaged and madvised allocations").
We have used an explicit __GFP_NORETRY previously which ruled the OOM
killer automagically.

Memcg charge path should be semantically compliant with the allocation
path and that means that if we do not trigger the OOM killer for costly
orders which should do the same in the memcg charge path as well.
Otherwise we are forcing callers to distinguish the two and use
different gfp masks which is both non-intuitive and bug prone.  As soon
as we get a costly high order kmalloc user we even do not have any means
to tell the memcg specific gfp mask to prevent from OOM because the
charging is deep within guts of the slab allocator.

The unexpected memcg OOM on THP has already been fixed upstream by
9d3c3354bb85 ("mm, thp: do not cause memcg oom for thp") but this is a
one-off fix rather than a generic solution.  Teach mem_cgroup_oom to
bail out on costly order requests to fix the THP issue as well as any
other costly OOM eligible allocations to be added in future.

Also revert 9d3c3354bb85 because special gfp for THP is no longer
needed.

Link: http://lkml.kernel.org/r/20180403193129.22146-1-mhocko@kernel.org
Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
Signed-off-by: Michal Hocko 
Acked-by: Johannes Weiner 
Cc: "Kirill A. Shutemov" 
Cc: Vlastimil Babka 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/huge_memory.c: reorder operations in __split_huge_page_tail()

2018-04-06T04:36:25+00:00

THP split makes non-atomic change of tail page flags.  This is almost ok
because tail pages are locked and isolated but this breaks recent
changes in page locking: non-atomic operation could clear bit
PG_waiters.

As a result concurrent sequence get_page_unless_zero() -> lock_page()
might block forever.  Especially if this page was truncated later.

Fix is trivial: clone flags before unfreezing page reference counter.

This race exists since commit 62906027091f ("mm: add PageWaiters
indicating tasks are waiting for a page bit") while unsave unfreeze
itself was added in commit 8df651c7059e ("thp: cleanup
split_huge_page()").

clear_compound_head() also must be called before unfreezing page
reference because after successful get_page_unless_zero() might follow
put_page() which needs correct compound_head().

And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
is made especially for that and has semantic of smp_store_release().

Link: http://lkml.kernel.org/r/151844393341.210639.13162088407980624477.stgit@buzz
Signed-off-by: Konstantin Khlebnikov 
Acked-by: Kirill A. Shutemov 
Cc: Michal Hocko 
Cc: Nicholas Piggin 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, thp: do not cause memcg oom for thp

2018-03-23T00:07:02+00:00

Commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
madvised allocations") changed the page allocator to no longer detect
thp allocations based on __GFP_NORETRY.

It did not, however, modify the mem cgroup try_charge() path to avoid
oom kill for either khugepaged collapsing or thp faulting.  It is never
expected to oom kill a process to allocate a hugepage for thp; reclaim
is governed by the thp defrag mode and MADV_HUGEPAGE, but allocations
(and charging) should fallback instead of oom killing processes.

Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803191409420.124411@chino.kir.corp.google.com
Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
Signed-off-by: David Rientjes 
Cc: "Kirill A. Shutemov" 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Johannes Weiner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/thp: do not wait for lock_page() in deferred_split_scan()

2018-03-23T00:07:01+00:00

deferred_split_scan() gets called from reclaim path.  Waiting for page
lock may lead to deadlock there.

Replace lock_page() with trylock_page() and skip the page if we failed
to lock it.  We will get to the page on the next scan.

Link: http://lkml.kernel.org/r/20180315150747.31945-1-kirill.shutemov@linux.intel.com
Fixes: 9a982250f773 ("thp: introduce deferred_split_huge_page()")
Signed-off-by: Kirill A. Shutemov 
Acked-by: Michal Hocko 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/thp: remove pmd_huge_split_prepare()

2018-02-01T01:18:38+00:00

Instead of marking the pmd ready for split, invalidate the pmd.  This
should take care of powerpc requirement.  Only side effect is that we
mark the pmd invalid early.  This can result in us blocking access to
the page a bit longer if we race against a thp split.

[kirill.shutemov@linux.intel.com: rebased, dirty THP once]
Link: http://lkml.kernel.org/r/20171213105756.69879-13-kirill.shutemov@linux.intel.com
Signed-off-by: Aneesh Kumar K.V 
Signed-off-by: Kirill A. Shutemov 
Cc: Andrea Arcangeli 
Cc: Catalin Marinas 
Cc: David Daney 
Cc: David Miller 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Ingo Molnar 
Cc: Martin Schwidefsky 
Cc: Michal Hocko 
Cc: Nitin Gupta 
Cc: Ralf Baechle 
Cc: Thomas Gleixner 
Cc: Vineet Gupta 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: use updated pmdp_invalidate() interface to track dirty/accessed bits

2018-02-01T01:18:38+00:00

Use the modifed pmdp_invalidate() that returns the previous value of pmd
to transfer dirty and accessed bits.

Link: http://lkml.kernel.org/r/20171213105756.69879-12-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov 
Cc: Andrea Arcangeli 
Cc: Aneesh Kumar K.V 
Cc: Catalin Marinas 
Cc: David Daney 
Cc: David Miller 
Cc: H. Peter Anvin 
Cc: Hugh Dickins 
Cc: Ingo Molnar 
Cc: Martin Schwidefsky 
Cc: Michal Hocko 
Cc: Nitin Gupta 
Cc: Ralf Baechle 
Cc: Thomas Gleixner 
Cc: Vineet Gupta 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds