linux-stable.git/mm/huge_memory.c, branch linux-4.3.y

thp: use is_zero_pfn() only after pte_present() check

2015-10-23T08:55:10+00:00

Use is_zero_pfn() on pteval only after pte_present() check on pteval
(It might be better idea to introduce is_zero_pte() which checks
pte_present() first).

Otherwise when working on a swap or migration entry and if pte_pfn's
result is equal to zero_pfn by chance, we lose user's data in
__collapse_huge_page_copy().  So if you're unlucky, the application
segfaults and finally you could see below message on exit:

BUG: Bad rss-counter state mm:ffff88007f099300 idx:2 val:3

Fixes: ca0984caa823 ("mm: incorporate zero pages into transparent huge pages")
Signed-off-by: Minchan Kim 
Reviewed-by: Andrea Arcangeli 
Acked-by: Kirill A. Shutemov 
Cc: Mel Gorman 
Acked-by: Vlastimil Babka 
Cc: Hugh Dickins 
Cc: Rik van Riel 
Cc: 	[4.1+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: introduce idle page tracking

2015-09-10T20:29:01+00:00

Knowing the portion of memory that is not used by a certain application or
memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g.  by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced.  However,
this method has two serious shortcomings:

 - it does not count unmapped file pages
 - it affects the reclaimer logic

To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
A page's Idle flag can only be set from userspace by setting bit in
/sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
and it is cleared whenever the page is accessed either through page tables
(it is cleared in page_referenced() in this case) or using the read(2)
system call (mark_page_accessed()). Thus by setting the Idle flag for
pages of a particular workload, which can be found e.g.  by reading
/proc/PID/pagemap, waiting for some time to let the workload access its
working set, and then reading the bitmap file, one can estimate the amount
of pages that are not used by the workload.

The Young page flag is used to avoid interference with the memory
reclaimer.  A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to the bitmap file.
If page_referenced() is called on a Young page, it will add 1 to its
return value, therefore concealing the fact that the Access bit was
cleared.

Note, since there is no room for extra page flags on 32 bit, this feature
uses extended page flags when compiled on 32 bit.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: kpageidle requires an MMU]
[akpm@linux-foundation.org: decouple from page-flags rework]
Signed-off-by: Vladimir Davydov 
Reviewed-by: Andres Lagar-Cavilla 
Cc: Minchan Kim 
Cc: Raghavendra K T 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Greg Thelen 
Cc: Michel Lespinasse 
Cc: David Rientjes 
Cc: Pavel Emelyanov 
Cc: Cyrill Gorcunov 
Cc: Jonathan Corbet 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/khugepaged: allow interruption of allocation sleep again

2015-09-08T22:35:28+00:00

Commit 1dfb059b9438 ("thp: reduce khugepaged freezing latency") fixed
khugepaged to do not block a system suspend.  But the result is that it
could not get interrupted before the given timeout because the condition
for the wait event is "false".

This patch puts back the original approach but it uses
freezable_schedule_timeout_interruptible() instead of
schedule_timeout_interruptible().  It does the right thing.  I am pretty
sure that the freezable variant was not used in the original fix only
because it was not available at that time.

The regression has been there for ages.  It was not critical.  It just
did the allocation throttling a little bit more aggressively.

I found this problem when converting the kthread to kthread worker API
and trying to understand the code.

This bug is thought to have minimal userspace-visible impact.  Somebody
could set a high alloc_sleep value by mistake, and then try to fix it
back, but khugepaged would keep sleeping until the high value expires.

Signed-off-by: Petr Mladek 
Cc: Andrea Arcangeli 
Acked-by: Vlastimil Babka 
Cc: "Aneesh Kumar K.V" 
Cc: "Kirill A. Shutemov" 
Cc: David Rientjes 
Cc: Ebru Akagunduz 
Cc: Mel Gorman 
Cc: Jiri Kosina 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: rename alloc_pages_exact_node() to __alloc_pages_node()

2015-09-08T22:35:28+00:00

alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
allocator: do not check NUMA node ID when the caller knows the node is
valid") as an optimized variant of alloc_pages_node(), that doesn't
fallback to current node for nid == NUMA_NO_NODE.  Unfortunately the
name of the function can easily suggest that the allocation is
restricted to the given node and fails otherwise.  In truth, the node is
only preferred, unless __GFP_THISNODE is passed among the gfp flags.

The misleading name has lead to mistakes in the past, see for example
commits 5265047ac301 ("mm, thp: really limit transparent hugepage
allocation to local node") and b360edb43f8e ("mm, mempolicy:
migrate_to_node should only migrate to node").

Another issue with the name is that there's a family of
alloc_pages_exact*() functions where 'exact' means exact size (instead
of page order), which leads to more confusion.

To prevent further mistakes, this patch effectively renames
alloc_pages_exact_node() to __alloc_pages_node() to better convey that
it's an optimized variant of alloc_pages_node() not intended for general
usage.  Both functions get described in comments.

It has been also considered to really provide a convenience function for
allocations restricted to a node, but the major opinion seems to be that
__GFP_THISNODE already provides that functionality and we shouldn't
duplicate the API needlessly.  The number of users would be small
anyway.

Existing callers of alloc_pages_exact_node() are simply converted to
call __alloc_pages_node(), with the exception of sba_alloc_coherent()
which open-codes the check for NUMA_NO_NODE, so it is converted to use
alloc_pages_node() instead.  This means it no longer performs some
VM_BUG_ON checks, and since the current check for nid in
alloc_pages_node() uses a 'nid < 0' comparison (which includes
NUMA_NO_NODE), it may hide wrong values which would be previously
exposed.

Both differences will be rectified by the next patch.

To sum up, this patch makes no functional changes, except temporarily
hiding potentially buggy callers.  Restricting the checks in
alloc_pages_node() is left for the next patch which can in turn expose
more existing buggy callers.

Signed-off-by: Vlastimil Babka 
Acked-by: Johannes Weiner 
Acked-by: Robin Holt 
Acked-by: Michal Hocko 
Acked-by: Christoph Lameter 
Acked-by: Michael Ellerman 
Cc: Mel Gorman 
Cc: David Rientjes 
Cc: Greg Thelen 
Cc: Aneesh Kumar K.V 
Cc: Pekka Enberg 
Cc: Joonsoo Kim 
Cc: Naoya Horiguchi 
Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Arnd Bergmann 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Gleb Natapov 
Cc: Paolo Bonzini 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Cliff Whickman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: make set_recommended_min_free_kbytes() return void

2015-09-08T22:35:28+00:00

This makes set_recommended_min_free_kbytes() have a return type of void as
it cannot fail.

Signed-off-by: Nicholas Krause 
Acked-by: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

dax: don't use set_huge_zero_page()

2015-09-08T22:35:28+00:00

This is another place where DAX assumed that pgtable_t was a pointer.
Open code the important parts of set_huge_zero_page() in DAX and make
set_huge_zero_page() static again.

Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Matthew Wilcox 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

thp: fix zap_huge_pmd() for DAX

2015-09-08T22:35:28+00:00

The original DAX code assumed that pgtable_t was a pointer, which isn't
true on all architectures.  Restructure the code to not rely on that
assumption.

[willy@linux.intel.com: further fixes integrated into this patch]
Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Matthew Wilcox 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

thp: decrement refcount on huge zero page if it is split

2015-09-08T22:35:28+00:00

The DAX code neglected to put the refcount on the huge zero page.
Also we must notify on splits.

Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Matthew Wilcox 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

thp: change insert_pfn's return type to void

2015-09-08T22:35:28+00:00

It would make more sense to have all the return values from
vmf_insert_pfn_pmd() encoded in one place instead of having to follow
the convention into insert_pfn().  Suggested by Jeff Moyer.

Signed-off-by: Matthew Wilcox 
Cc: Jeff Moyer 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: add vmf_insert_pfn_pmd()

2015-09-08T22:35:28+00:00

Similar to vm_insert_pfn(), but for PMDs rather than PTEs.  The 'vmf_'
prefix instead of 'vm_' prefix is intended to indicate that it returns a
VMF_ value rather than an errno (which would only have to be converted
into a VMF_ value anyway).

Signed-off-by: Matthew Wilcox 
Cc: Hillf Danton 
Cc: "Kirill A. Shutemov" 
Cc: Theodore Ts'o 
Cc: Jan Kara 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds