linux.git/mm/memory-failure.c, branch v3.17

hwpoison: fix race with changing page during offlining

2014-08-07T01:01:19+00:00

When a hwpoison page is locked it could change state due to parallel
modifications.  The original compound page can be torn down and then
this 4k page becomes part of a differently-size compound page is is a
standalone regular page.

Check after the lock if the page is still the same compound page.

We could go back, grab the new head page and try again but it should be
quite rare, so I thought this was safest.  A retry loop would be more
difficult to test and may have more side effects.

The hwpoison code by design only tries to handle cases that are
reasonably common in workloads, as visible in page-flags.

I'm not really that concerned about handling this (likely rare case),
just not crashing on it.

Signed-off-by: Andi Kleen 
Acked-by: Naoya Horiguchi 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

hwpoison: call action_result() in failure path of hwpoison_user_mappings()

2014-07-31T00:16:13+00:00

hwpoison_user_mappings() could fail for various reasons, so printk()s to
print out the reasons should be done in each failure check inside
hwpoison_user_mappings().

And currently we don't call action_result() when hwpoison_user_mappings()
fails, which is not consistent with other exit points of memory error
handler.  So this patch fixes these messaging problems.

Signed-off-by: Naoya Horiguchi 
Cc: Andi Kleen 
Cc: Chen Yucong 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

hwpoison: fix hugetlbfs/thp precheck in hwpoison_user_mappings()

2014-07-31T00:16:13+00:00

A recent fix from Chen Yucong, commit 0bc1f8b0682c ("hwpoison: fix the
handling path of the victimized page frame that belong to non-LRU")
rejects going into unmapping operation for hugetlbfs/thp pages, which
results in failing error containing on such pages.  This patch fixes it.

With this patch, hwpoison functional tests in mce-test testsuite pass.

Signed-off-by: Naoya Horiguchi 
Cc: Andi Kleen 
Cc: Chen Yucong 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/rmap.c: fix pgoff calculation to handle hugepage correctly

2014-07-23T22:10:54+00:00

I triggered VM_BUG_ON() in vma_address() when I tried to migrate an
anonymous hugepage with mbind() in the kernel v3.16-rc3.  This is
because pgoff's calculation in rmap_walk_anon() fails to consider
compound_order() only to have an incorrect value.

This patch introduces page_to_pgoff(), which gets the page's offset in
PAGE_CACHE_SIZE.

Kirill pointed out that page cache tree should natively handle
hugepages, and in order to make hugetlbfs fit it, page->index of
hugetlbfs page should be in PAGE_CACHE_SIZE.  This is beyond this patch,
but page_to_pgoff() contains the point to be fixed in a single function.

Signed-off-by: Naoya Horiguchi 
Acked-by: Kirill A. Shutemov 
Cc: Joonsoo Kim 
Cc: Hugh Dickins 
Cc: Rik van Riel 
Cc: Hillf Danton 
Cc: Naoya Horiguchi 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

hwpoison: fix the handling path of the victimized page frame that belong to non-LRU

2014-07-03T16:21:54+00:00

Until now, the kernel has the same policy to handle victimized page
frames that belong to kernel-space(reserved/slab-subsystem) or
non-LRU(unknown page state).  In other word, the result of handling
either of these victimized page frames is (IGNORED | FAILED), and the
return value of memory_failure() is -EBUSY.

This patch is to avoid that memory_failure() returns very soon due to
the "true" value of (!PageLRU(p)), and it also ensures that
action_result() can report more precise information("reserved kernel",
"kernel slab", and "unknown page state") instead of "non LRU",
especially for memory errors which are detected by memory-scrubbing.

Andi said:

: While running the mcelog test suite on 3.14 I hit the following VM_BUG_ON:
:
: soft_offline: 0x56d4: unknown non LRU page type 3ffff800008000
: page:ffffea000015b400 count:3 mapcount:2097169 mapping:          (null) index:0xffff8800056d7000
: page flags: 0x3ffff800004081(locked|slab|head)
: ------------[ cut here ]------------
: kernel BUG at mm/rmap.c:1495!
:
: I think what happened is that a LRU page turned into a slab page in
: parallel with offlining.  memory_failure initially tests for this case,
: but doesn't retest later after the page has been locked.
:
: ...
:
: I ran this patch in a loop over night with some stress plus
: the mcelog test suite running in a loop. I cannot guarantee it hit it,
: but it should have given it a good beating.
:
: The kernel survived with no messages, although the mcelog test suite
: got killed at some point because it couldn't fork anymore. Probably
: some unrelated problem.
:
: So the patch is ok for me for .16.

Signed-off-by: Chen Yucong 
Acked-by: Naoya Horiguchi 
Reported-by: Andi Kleen 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory-failure.c: support use of a dedicated thread to handle SIGBUS(BUS_MCEERR_AO)

2014-06-04T23:54:13+00:00

Currently memory error handler handles action optional errors in the
deferred manner by default.  And if a recovery aware application wants
to handle it immediately, it can do it by setting PF_MCE_EARLY flag.
However, such signal can be sent only to the main thread, so it's
problematic if the application wants to have a dedicated thread to
handler such signals.

So this patch adds dedicated thread support to memory error handler.  We
have PF_MCE_EARLY flags for each thread separately, so with this patch
AO signal is sent to the thread with PF_MCE_EARLY flag set, not the main
thread.  If you want to implement a dedicated thread, you call prctl()
to set PF_MCE_EARLY on the thread.

Memory error handler collects processes to be killed, so this patch lets
it check PF_MCE_EARLY flag on each thread in the collecting routines.

No behavioral change for all non-early kill cases.

Tony said:

: The old behavior was crazy - someone with a multithreaded process might
: well expect that if they call prctl(PF_MCE_EARLY) in just one thread, then
: that thread would see the SIGBUS with si_code = BUS_MCEERR_A0 - even if
: that thread wasn't the main thread for the process.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Naoya Horiguchi 
Reviewed-by: Tony Luck 
Cc: Kamil Iskra 
Cc: Andi Kleen 
Cc: Borislav Petkov 
Cc: Chen Gong 
Cc: 	[3.2+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory-failure.c: don't let collect_procs() skip over processes for MF_ACTION_REQUIRED

2014-06-04T23:54:13+00:00

When Linux sees an "action optional" machine check (where h/w has reported
an error that is not in the current execution path) we generally do not
want to signal a process, since most processes do not have a SIGBUS
handler - we'd just prematurely terminate the process for a problem that
they might never actually see.

task_early_kill() decides whether to consider a process - and it checks
whether this specific process has been marked for early signals with
"prctl", or if the system administrator has requested early signals for
all processes using /proc/sys/vm/memory_failure_early_kill.

But for MF_ACTION_REQUIRED case we must not defer.  The error is in the
execution path of the current thread so we must send the SIGBUS
immediatley.

Fix by passing a flag argument through collect_procs*() to
task_early_kill() so it knows whether we can defer or must take action.

Signed-off-by: Tony Luck 
Signed-off-by: Naoya Horiguchi 
Cc: Andi Kleen 
Cc: Borislav Petkov 
Cc: Chen Gong 
Cc: 	[3.2+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory-failure.c-failure: send right signal code to correct thread

2014-06-04T23:54:13+00:00

When a thread in a multi-threaded application hits a machine check because
of an uncorrectable error in memory - we want to send the SIGBUS with
si.si_code = BUS_MCEERR_AR to that thread.  Currently we fail to do that
if the active thread is not the primary thread in the process.
collect_procs() just finds primary threads and this test:

	if ((flags & MF_ACTION_REQUIRED) && t == current) {

will see that the thread we found isn't the current thread and so send a
si.si_code = BUS_MCEERR_AO to the primary (and nothing to the active
thread at this time).

We can fix this by checking whether "current" shares the same mm with the
process that collect_procs() said owned the page.  If so, we send the
SIGBUS to current (with code BUS_MCEERR_AR).

Signed-off-by: Tony Luck 
Signed-off-by: Naoya Horiguchi 
Reported-by: Otto Bruggeman 
Cc: Andi Kleen 
Cc: Borislav Petkov 
Cc: Chen Gong 
Cc: 	[3.2+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory-failure.c: move comment

2014-06-04T23:54:10+00:00

The comment about pages under writeback is far from the relevant code, so
let's move it to the right place.

Signed-off-by: Naoya Horiguchi 
Cc: Andi Kleen 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, migration: add destination page freeing callback

2014-06-04T23:54:06+00:00

Memory migration uses a callback defined by the caller to determine how to
allocate destination pages.  When migration fails for a source page,
however, it frees the destination page back to the system.

This patch adds a memory migration callback defined by the caller to
determine how to free destination pages.  If a caller, such as memory
compaction, builds its own freelist for migration targets, this can reuse
already freed memory instead of scanning additional memory.

If the caller provides a function to handle freeing of destination pages,
it is called when page migration fails.  If the caller passes NULL then
freeing back to the system will be handled as usual.  This patch
introduces no functional change.

Signed-off-by: David Rientjes 
Reviewed-by: Naoya Horiguchi 
Acked-by: Mel Gorman 
Acked-by: Vlastimil Babka 
Cc: Greg Thelen 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds