linux-stable.git/mm, branch v3.2.88

mm/huge_memory.c: fix up "mm/huge_memory.c: respect FOLL_FORCE/FOLL_COW for thp" backport

2017-04-04T21:18:32+00:00

This is a stable follow up fix for an incorrect backport. The issue is
not present in the upstream kernel.

Miroslav has noticed the following splat when testing my 3.2 forward
port of 8310d48b125d ("mm/huge_memory.c: respect FOLL_FORCE/FOLL_COW for
thp") to 3.12:

BUG: Bad page state in process a.out  pfn:26400
page:ffffea000085e000 count:0 mapcount:1 mapping:          (null) index:0x7f049d600
page flags: 0x1fffff80108018(uptodate|dirty|head|swapbacked)
page dumped because: nonzero mapcount
[iii]
CPU: 2 PID: 5926 Comm: a.out Tainted: G            E    3.12.61-0-default #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
 0000000000000000 ffffffff81515830 ffffea000085e000 ffffffff81800ad7
 ffffffff815118a5 ffffea000085e000 0000000000000000 000fffff80000000
 ffffffff81140f18 fff000007c000000 ffffea000085e000 0000000000000009
Call Trace:
 [] dump_trace+0x7d/0x2d0
 [] show_stack_log_lvl+0x94/0x170
 [] show_stack+0x21/0x50
 [] dump_stack+0x5d/0x78
 [] bad_page.part.67+0xe8/0x102
 [] free_pages_prepare+0x198/0x1b0
 [] __free_pages_ok+0x15/0xd0
 [] __access_remote_vm+0x7c/0x1e0
 [] mem_rw.isra.13+0x14b/0x1a0
 [] vfs_write+0xb8/0x1e0
 [] SyS_pwrite64+0x6b/0xa0
 [] system_call_fastpath+0x16/0x1b
 [<00007f049da18573>] 0x7f049da18572

The problem is that the original 3.2 backport didn't return NULL page on
the FOLL_COW page and so the page got reused.

Reported-and-tested-by: Miroslav Beneš 
Signed-off-by: Michal Hocko 
Signed-off-by: Ben Hutchings

mm, fs: check for fatal signals in do_generic_file_read()

2017-03-16T02:18:47+00:00

commit 5abf186a30a89d5b9c18a6bf93a2c192c9fd52f6 upstream.

do_generic_file_read() can be told to perform a large request from
userspace.  If the system is under OOM and the reading task is the OOM
victim then it has an access to memory reserves and finishing the full
request can lead to the full memory depletion which is dangerous.  Make
sure we rather go with a short read and allow the killed task to
terminate.

Link: http://lkml.kernel.org/r/20170201092706.9966-3-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Reviewed-by: Christoph Hellwig 
Cc: Tetsuo Handa 
Cc: Al Viro 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Ben Hutchings

mm/huge_memory.c: respect FOLL_FORCE/FOLL_COW for thp

2017-03-16T02:18:46+00:00

commit 8310d48b125d19fcd9521d83b8293e63eb1646aa upstream.

In commit 19be0eaffa3a ("mm: remove gup_flags FOLL_WRITE games from
__get_user_pages()"), the mm code was changed from unsetting FOLL_WRITE
after a COW was resolved to setting the (newly introduced) FOLL_COW
instead.  Simultaneously, the check in gup.c was updated to still allow
writes with FOLL_FORCE set if FOLL_COW had also been set.

However, a similar check in huge_memory.c was forgotten.  As a result,
remote memory writes to ro regions of memory backed by transparent huge
pages cause an infinite loop in the kernel (handle_mm_fault sets
FOLL_COW and returns 0 causing a retry, but follow_trans_huge_pmd bails
out immidiately because `(flags & FOLL_WRITE) && !pmd_write(*pmd)` is
true.

While in this state the process is stil SIGKILLable, but little else
works (e.g.  no ptrace attach, no other signals).  This is easily
reproduced with the following code (assuming thp are set to always):

    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 

    #define TEST_SIZE 5 * 1024 * 1024

    int main(void) {
      int status;
      pid_t child;
      int fd = open("/proc/self/mem", O_RDWR);
      void *addr = mmap(NULL, TEST_SIZE, PROT_READ,
                        MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
      assert(addr != MAP_FAILED);
      pid_t parent_pid = getpid();
      if ((child = fork()) == 0) {
        void *addr2 = mmap(NULL, TEST_SIZE, PROT_READ | PROT_WRITE,
                           MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
        assert(addr2 != MAP_FAILED);
        memset(addr2, 'a', TEST_SIZE);
        pwrite(fd, addr2, TEST_SIZE, (uintptr_t)addr);
        return 0;
      }
      assert(child == waitpid(child, &status, 0));
      assert(WIFEXITED(status) && WEXITSTATUS(status) == 0);
      return 0;
    }

Fix this by updating follow_trans_huge_pmd in huge_memory.c analogously
to the update in gup.c in the original commit.  The same pattern exists
in follow_devmap_pmd.  However, we should not be able to reach that
check with FOLL_COW set, so add WARN_ONCE to make sure we notice if we
ever do.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20170106015025.GA38411@juliacomputing.com
Signed-off-by: Keno Fischer 
Acked-by: Kirill A. Shutemov 
Cc: Greg Thelen 
Cc: Nicholas Piggin 
Cc: Willy Tarreau 
Cc: Oleg Nesterov 
Cc: Kees Cook 
Cc: Andy Lutomirski 
Cc: Michal Hocko 
Cc: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
[bwh: Backported to 3.2:
 - Drop change to follow_devmap_pmd()
 - pmd_dirty() is not available; check the page flags as in
   can_follow_write_pte()
 - Adjust context]
Signed-off-by: Ben Hutchings

swapfile: fix memory corruption via malformed swapfile

2017-02-23T03:50:59+00:00

commit dd111be69114cc867f8e826284559bfbc1c40e37 upstream.

When root activates a swap partition whose header has the wrong
endianness, nr_badpages elements of badpages are swabbed before
nr_badpages has been checked, leading to a buffer overrun of up to 8GB.

This normally is not a security issue because it can only be exploited
by root (more specifically, a process with CAP_SYS_ADMIN or the ability
to modify a swap file/partition), and such a process can already e.g.
modify swapped-out memory of any other userspace process on the system.

Link: http://lkml.kernel.org/r/1477949533-2509-1-git-send-email-jann@thejh.net
Signed-off-by: Jann Horn 
Acked-by: Kees Cook 
Acked-by: Jerome Marchand 
Acked-by: Johannes Weiner 
Cc: "Kirill A. Shutemov" 
Cc: Vlastimil Babka 
Cc: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Ben Hutchings

fs: Give dentry to inode_change_ok() instead of inode

2016-11-20T01:01:43+00:00

commit 31051c85b5e2aaaf6315f74c72a732673632a905 upstream.

inode_change_ok() will be resposible for clearing capabilities and IMA
extended attributes and as such will need dentry. Give it as an argument
to inode_change_ok() instead of an inode. Also rename inode_change_ok()
to setattr_prepare() to better relect that it does also some
modifications in addition to checks.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
[bwh: Backported to 3.2:
 - Drop changes to f2fs, lustre, orangefs, overlayfs
 - Adjust filenames, context
 - In nfsd, pass dentry to nfsd_sanitize_attrs()
 - In xfs, pass dentry to xfs_change_file_space(), xfs_set_mode(),
   xfs_setattr_nonsize(), and xfs_setattr_size()
 - Update ext3 as well
 - Mark pohmelfs as BROKEN; it's long dead upstream]
Signed-off-by: Ben Hutchings

mm,ksm: fix endless looping in allocating memory when ksm enable

2016-11-20T01:01:43+00:00

commit 5b398e416e880159fe55eefd93c6588fa072cd66 upstream.

I hit the following hung task when runing a OOM LTP test case with 4.1
kernel.

Call trace:
[] __switch_to+0x74/0x8c
[] __schedule+0x23c/0x7bc
[] schedule+0x3c/0x94
[] rwsem_down_write_failed+0x214/0x350
[] down_write+0x64/0x80
[] __ksm_exit+0x90/0x19c
[] mmput+0x118/0x11c
[] do_exit+0x2dc/0xa74
[] do_group_exit+0x4c/0xe4
[] get_signal+0x444/0x5e0
[] do_signal+0x1d8/0x450
[] do_notify_resume+0x70/0x78

The oom victim cannot terminate because it needs to take mmap_sem for
write while the lock is held by ksmd for read which loops in the page
allocator

ksm_do_scan
	scan_get_next_rmap_item
		down_read
		get_next_rmap_item
			alloc_rmap_item   #ksmd will loop permanently.

There is no way forward because the oom victim cannot release any memory
in 4.1 based kernel.  Since 4.6 we have the oom reaper which would solve
this problem because it would release the memory asynchronously.
Nevertheless we can relax alloc_rmap_item requirements and use
__GFP_NORETRY because the allocation failure is acceptable as ksm_do_scan
would just retry later after the lock got dropped.

Such a patch would be also easy to backport to older stable kernels which
do not have oom_reaper.

While we are at it add GFP_NOWARN so the admin doesn't have to be alarmed
by the allocation failure.

Link: http://lkml.kernel.org/r/1474165570-44398-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang 
Suggested-by: Hugh Dickins 
Suggested-by: Michal Hocko 
Acked-by: Michal Hocko 
Acked-by: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Ben Hutchings

mm/hugetlb: avoid soft lockup in set_max_huge_pages()

2016-11-20T01:01:30+00:00

commit 649920c6ab93429b94bc7c1aa7c0e8395351be32 upstream.

In powerpc servers with large memory(32TB), we watched several soft
lockups for hugepage under stress tests.

The call traces are as follows:
1.
get_page_from_freelist+0x2d8/0xd50
__alloc_pages_nodemask+0x180/0xc20
alloc_fresh_huge_page+0xb0/0x190
set_max_huge_pages+0x164/0x3b0

2.
prep_new_huge_page+0x5c/0x100
alloc_fresh_huge_page+0xc8/0x190
set_max_huge_pages+0x164/0x3b0

This patch fixes such soft lockups.  It is safe to call cond_resched()
there because it is out of spin_lock/unlock section.

Link: http://lkml.kernel.org/r/1469674442-14848-1-git-send-email-hejianet@gmail.com
Signed-off-by: Jia He 
Reviewed-by: Naoya Horiguchi 
Acked-by: Michal Hocko 
Acked-by: Dave Hansen 
Cc: Mike Kravetz 
Cc: "Kirill A. Shutemov" 
Cc: Paul Gortmaker 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings

mm, gup: close FOLL MAP_PRIVATE race

2016-10-20T22:41:00+00:00

commit 19be0eaffa3ac7d8eb6784ad9bdbc7d67ed8e619 upstream.

faultin_page drops FOLL_WRITE after the page fault handler did the CoW
and then we retry follow_page_mask to get our CoWed page. This is racy,
however because the page might have been unmapped by that time and so
we would have to do a page fault again, this time without CoW. This
would cause the page cache corruption for FOLL_FORCE on MAP_PRIVATE
read only mappings with obvious consequences.

This is an ancient bug that was actually already fixed once by Linus
eleven years ago in commit 4ceb5db9757a ("Fix get_user_pages() race
for write access") but that was then undone due to problems on s390
by commit f33ea7f404e5 ("fix get_user_pages bug") because s390 didn't
have proper dirty pte tracking until abf09bed3cce ("s390/mm: implement
software dirty bits"). This wasn't a problem at the time as pointed out
by Hugh Dickins because madvise relied on mmap_sem for write up until
0a27a14a6292 ("mm: madvise avoid exclusive mmap_sem") but since then we
can race with madvise which can unmap the fresh COWed page or with KSM
and corrupt the content of the shared page.

This patch is based on the Linus' approach to not clear FOLL_WRITE after
the CoW page fault (aka VM_FAULT_WRITE) but instead introduces FOLL_COW
to note this fact. The flag is then rechecked during follow_pfn_pte to
enforce the page fault again if we do not see the CoWed page. Linus was
suggesting to check pte_dirty again as s390 is OK now. But that would
make backporting to some old kernels harder. So instead let's just make
sure that vm_normal_page sees a pure anonymous page.

This would guarantee we are seeing a real CoW page. Introduce
can_follow_write_pte which checks both pte_write and falls back to
PageAnon on forced write faults which passed CoW already. Thanks to Hugh
to point out that a special care has to be taken for KSM pages because
our COWed page might have been merged with a KSM one and keep its
PageAnon flag.

Fixes: 0a27a14a6292 ("mm: madvise avoid exclusive mmap_sem")
Reported-by: Phil "not Paul" Oester 
Disclosed-by: Andy Lutomirski 
Signed-off-by: Linus Torvalds 
Signed-off-by: Michal Hocko 
[bwh: Backported to 3.2:
 - Adjust filename, context, indentation
 - The 'no_page' exit path in follow_page() is different, so open-code the
   cleanup
 - Delete a now-unused label]
Signed-off-by: Ben Hutchings

mm: Export migrate_page_move_mapping and migrate_page_copy

2016-08-22T21:37:15+00:00

commit 1118dce773d84f39ebd51a9fe7261f9169cb056e upstream.

Export these symbols such that UBIFS can implement
->migratepage.

Signed-off-by: Richard Weinberger 
Acked-by: Christoph Hellwig 
[bwh: Backported to 3.2: also change migrate_page_move_mapping() from
 static to extern, done as part of an earlier commit upstream]
Signed-off-by: Ben Hutchings

mm/huge_memory: replace VM_NO_THP VM_BUG_ON with actual VMA check

2016-06-15T20:28:13+00:00

commit 3486b85a29c1741db99d0c522211c82d2b7a56d0 upstream.

Khugepaged detects own VMAs by checking vm_file and vm_ops but this way
it cannot distinguish private /dev/zero mappings from other special
mappings like /dev/hpet which has no vm_ops and popultes PTEs in mmap.

This fixes false-positive VM_BUG_ON and prevents installing THP where
they are not expected.

Link: http://lkml.kernel.org/r/CACT4Y+ZmuZMV5CjSFOeXviwQdABAgT7T+StKfTqan9YDtgEi5g@mail.gmail.com
Fixes: 78f11a255749 ("mm: thp: fix /dev/zero MAP_PRIVATE and vm_flags cleanups")
Signed-off-by: Konstantin Khlebnikov 
Reported-by: Dmitry Vyukov 
Acked-by: Vlastimil Babka 
Acked-by: Kirill A. Shutemov 
Cc: Dmitry Vyukov 
Cc: Andrea Arcangeli 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
[bwh: Backported to 3.2:
 - The assertions use VM_BUG_ON() and also check is_linear_pfn_mapping();
   keep that check
 - Adjust context]
Signed-off-by: Ben Hutchings