linux.git/mm/internal.h, branch v4.5

mm: polish virtual memory accounting

2016-02-03T16:28:43+00:00

* add VM_STACK as alias for VM_GROWSUP/DOWN depending on architecture
* always account VMAs with flag VM_STACK as stack (as it was before)
* cleanup classifying helpers
* update comments and documentation

Signed-off-by: Konstantin Khlebnikov 
Tested-by: Sudip Mukherjee 
Cc: Cyrill Gorcunov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: warn about VmData over RLIMIT_DATA

2016-02-03T16:28:43+00:00

This patch provides a way of working around a slight regression
introduced by commit 84638335900f ("mm: rework virtual memory
accounting").

Before that commit RLIMIT_DATA have control only over size of the brk
region.  But that change have caused problems with all existing versions
of valgrind, because it set RLIMIT_DATA to zero.

This patch fixes rlimit check (limit actually in bytes, not pages) and
by default turns it into warning which prints at first VmData misuse:

  "mmap: top (795): VmData 516096 exceed data ulimit 512000.  Will be forbidden soon."

Behavior is controlled by boot param ignore_rlimit_data=y/n and by sysfs
/sys/module/kernel/parameters/ignore_rlimit_data.  For now it set to "y".

[akpm@linux-foundation.org: tweak kernel-parameters.txt text[
Signed-off-by: Konstantin Khlebnikov 
Link: http://lkml.kernel.org/r/20151228211015.GL2194@uranus
Reported-by: Christian Borntraeger 
Cc: Cyrill Gorcunov 
Cc: Linus Torvalds 
Cc: Vegard Nossum 
Cc: Peter Zijlstra 
Cc: Vladimir Davydov 
Cc: Andy Lutomirski 
Cc: Quentin Casasnovas 
Cc: Kees Cook 
Cc: Willy Tarreau 
Cc: Pavel Emelyanov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

thp: reintroduce split_huge_page()

2016-01-16T01:56:32+00:00

This patch adds implementation of split_huge_page() for new
refcountings.

Unlike previous implementation, new split_huge_page() can fail if
somebody holds GUP pin on the page.  It also means that pin on page
would prevent it from bening split under you.  It makes situation in
many places much cleaner.

The basic scheme of split_huge_page():

  - Check that sum of mapcounts of all subpage is equal to page_count()
    plus one (caller pin). Foll off with -EBUSY. This way we can avoid
    useless PMD-splits.

  - Freeze the page counters by splitting all PMD and setup migration
    PTEs.

  - Re-check sum of mapcounts against page_count(). Page's counts are
    stable now. -EBUSY if page is pinned.

  - Split compound page.

  - Unfreeze the page by removing migration entries.

Signed-off-by: Kirill A. Shutemov 
Tested-by: Sasha Levin 
Tested-by: Aneesh Kumar K.V 
Acked-by: Jerome Marchand 
Cc: Vlastimil Babka 
Cc: Andrea Arcangeli 
Cc: Hugh Dickins 
Cc: Dave Hansen 
Cc: Mel Gorman 
Cc: Rik van Riel 
Cc: Naoya Horiguchi 
Cc: Steve Capper 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Christoph Lameter 
Cc: David Rientjes 

Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: drop tail page refcounting

2016-01-16T01:56:32+00:00

Tail page refcounting is utterly complicated and painful to support.

It uses ->_mapcount on tail pages to store how many times this page is
pinned.  get_page() bumps ->_mapcount on tail page in addition to
->_count on head.  This information is required by split_huge_page() to
be able to distribute pins from head of compound page to tails during
the split.

We will need ->_mapcount to account PTE mappings of subpages of the
compound page.  We eliminate need in current meaning of ->_mapcount in
tail pages by forbidding split entirely if the page is pinned.

The only user of tail page refcounting is THP which is marked BROKEN for
now.

Let's drop all this mess.  It makes get_page() and put_page() much
simpler.

Signed-off-by: Kirill A. Shutemov 
Tested-by: Sasha Levin 
Tested-by: Aneesh Kumar K.V 
Acked-by: Vlastimil Babka 
Acked-by: Jerome Marchand 
Cc: Andrea Arcangeli 
Cc: Hugh Dickins 
Cc: Dave Hansen 
Cc: Mel Gorman 
Cc: Rik van Riel 
Cc: Naoya Horiguchi 
Cc: Steve Capper 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Christoph Lameter 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: use 'unsigned int' for page order

2015-11-07T01:50:42+00:00

Let's try to be consistent about data type of page order.

[sfr@canb.auug.org.au: fix build (type of pageblock_order)]
[hughd@google.com: some configs end up with MAX_ORDER and pageblock_order having different types]
Signed-off-by: Kirill A. Shutemov 
Acked-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Reviewed-by: Andrea Arcangeli 
Cc: "Paul E. McKenney" 
Cc: Andi Kleen 
Cc: Aneesh Kumar K.V 
Cc: Christoph Lameter 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Sergey Senozhatsky 
Signed-off-by: Stephen Rothwell 
Signed-off-by: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: make compound_head() robust

2015-11-07T01:50:42+00:00

Hugh has pointed that compound_head() call can be unsafe in some
context. There's one example:

	CPU0					CPU1

isolate_migratepages_block()
  page_count()
    compound_head()
      !!PageTail() == true
					put_page()
					  tail->first_page = NULL
      head = tail->first_page
					alloc_pages(__GFP_COMP)
					   prep_compound_page()
					     tail->first_page = head
					     __SetPageTail(p);
      !!PageTail() == true
    

The race is pure theoretical. I don't it's possible to trigger it in
practice. But who knows.

We can fix the race by changing how encode PageTail() and compound_head()
within struct page to be able to update them in one shot.

The patch introduces page->compound_head into third double word block in
front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
the rest bits are pointer to head page if bit zero is set.

The patch moves page->pmd_huge_pte out of word, just in case if an
architecture defines pgtable_t into something what can have the bit 0
set.

hugetlb_cgroup uses page->lru.next in the second tail page to store
pointer struct hugetlb_cgroup. The patch switch it to use page->private
in the second tail page instead. The space is free since ->first_page is
removed from the union.

The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
limitation, since there's now space in first tail page to store struct
hugetlb_cgroup pointer. But that's out of scope of the patch.

That means page->compound_head shares storage space with:

 - page->lru.next;
 - page->next;
 - page->rcu_head.next;

That's too long list to be absolutely sure, but looks like nobody uses
bit 0 of the word.

page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
call_rcu_lazy() is not allowed as it makes use of the bit and we can
get false positive PageTail().

[1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com

Signed-off-by: Kirill A. Shutemov 
Acked-by: Michal Hocko 
Reviewed-by: Andrea Arcangeli 
Cc: Hugh Dickins 
Cc: David Rientjes 
Cc: Vlastimil Babka 
Acked-by: Paul E. McKenney 
Cc: Aneesh Kumar K.V 
Cc: Andi Kleen 
Cc: Christoph Lameter 
Cc: Joonsoo Kim 
Cc: Sergey Senozhatsky 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: page_alloc: hide some GFP internals and document the bits and flag combinations

2015-11-07T01:50:42+00:00

Andrew stated the following

	We have quite a history of remote parts of the kernel using
	weird/wrong/inexplicable combinations of __GFP_ flags.	I tend
	to think that this is because we didn't adequately explain the
	interface.

	And I don't think that gfp.h really improved much in this area as
	a result of this patchset.  Could you go through it some time and
	decide if we've adequately documented all this stuff?

This patches first moves some GFP flag combinations that are part of the MM
internals to mm/internal.h. The rest of the patch documents the __GFP_FOO
bits under various headings and then documents the flag combinations. It
will not help callers that are brain damaged but the clarity might motivate
some fixes and avoid future mistakes.

Signed-off-by: Mel Gorman 
Cc: Johannes Weiner 
Cc: Rik van Riel 
Cc: Vlastimil Babka 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Michal Hocko 
Cc: Vitaly Wool 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, page_alloc: remove unnecessary recalculations for dirty zone balancing

2015-11-07T01:50:42+00:00

File-backed pages that will be immediately written are balanced between
zones.  This heuristic tries to avoid having a single zone filled with
recently dirtied pages but the checks are unnecessarily expensive.  Move
consider_zone_balanced into the alloc_context instead of checking bitmaps
multiple times.  The patch also gives the parameter a more meaningful
name.

Signed-off-by: Mel Gorman 
Acked-by: David Rientjes 
Acked-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Acked-by: Johannes Weiner 
Cc: Christoph Lameter 
Cc: Vitaly Wool 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: page migration fix PageMlocked on migrated pages

2015-11-06T03:34:48+00:00

Commit e6c509f85455 ("mm: use clear_page_mlock() in page_remove_rmap()")
in v3.7 inadvertently made mlock_migrate_page() impotent: page migration
unmaps the page from userspace before migrating, and that commit clears
PageMlocked on the final unmap, leaving mlock_migrate_page() with
nothing to do.  Not a serious bug, the next attempt at reclaiming the
page would fix it up; but a betrayal of page migration's intent - the
new page ought to emerge as PageMlocked.

I don't see how to fix it for mlock_migrate_page() itself; but easily
fixed in remove_migration_pte(), by calling mlock_vma_page() when the vma
is VM_LOCKED - under pte lock as in try_to_unmap_one().

Delete mlock_migrate_page()?  Not quite, it does still serve a purpose for
migrate_misplaced_transhuge_page(): where we could replace it by a test,
clear_page_mlock(), mlock_vma_page() sequence; but would that be an
improvement?  mlock_migrate_page() is fairly lean, and let's make it
leaner by skipping the irq save/restore now clearly not needed.

Signed-off-by: Hugh Dickins 
Cc: Christoph Lameter 
Cc: "Kirill A. Shutemov" 
Cc: Rik van Riel 
Acked-by: Vlastimil Babka 
Cc: Davidlohr Bueso 
Cc: Oleg Nesterov 
Cc: Sasha Levin 
Cc: Dmitry Vyukov 
Cc: KOSAKI Motohiro 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/compaction: correct to flush migrated pages if pageblock skip happens

2015-09-08T22:35:28+00:00

We cache isolate_start_pfn before entering isolate_migratepages().  If
pageblock is skipped in isolate_migratepages() due to whatever reason,
cc->migrate_pfn can be far from isolate_start_pfn hence we flush pages
that were freed.  For example, the following scenario can be possible:

- assume order-9 compaction, pageblock order is 9
- start_isolate_pfn is 0x200
- isolate_migratepages()
  - skip a number of pageblocks
  - start to isolate from pfn 0x600
  - cc->migrate_pfn = 0x620
  - return
- last_migrated_pfn is set to 0x200
- check flushing condition
  - current_block_start is set to 0x600
  - last_migrated_pfn < current_block_start then do useless flush

This wrong flush would not help the performance and success rate so this
patch tries to fix it.  One simple way to know the exact position where
we start to isolate migratable pages is that we cache it in
isolate_migratepages() before entering actual isolation.  This patch
implements that and fixes the problem.

Signed-off-by: Joonsoo Kim 
Acked-by: Vlastimil Babka 
Cc: Mel Gorman 
Cc: Rik van Riel 
Cc: David Rientjes 
Cc: Minchan Kim 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds