linux-stable.git/mm, branch v4.1.41

mm/huge_memory.c: respect FOLL_FORCE/FOLL_COW for thp

2017-06-13T13:29:21+00:00

[ Upstream commit 8310d48b125d19fcd9521d83b8293e63eb1646aa ]

In commit 19be0eaffa3a ("mm: remove gup_flags FOLL_WRITE games from
__get_user_pages()"), the mm code was changed from unsetting FOLL_WRITE
after a COW was resolved to setting the (newly introduced) FOLL_COW
instead.  Simultaneously, the check in gup.c was updated to still allow
writes with FOLL_FORCE set if FOLL_COW had also been set.

However, a similar check in huge_memory.c was forgotten.  As a result,
remote memory writes to ro regions of memory backed by transparent huge
pages cause an infinite loop in the kernel (handle_mm_fault sets
FOLL_COW and returns 0 causing a retry, but follow_trans_huge_pmd bails
out immidiately because `(flags & FOLL_WRITE) && !pmd_write(*pmd)` is
true.

While in this state the process is stil SIGKILLable, but little else
works (e.g.  no ptrace attach, no other signals).  This is easily
reproduced with the following code (assuming thp are set to always):

    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 

    #define TEST_SIZE 5 * 1024 * 1024

    int main(void) {
      int status;
      pid_t child;
      int fd = open("/proc/self/mem", O_RDWR);
      void *addr = mmap(NULL, TEST_SIZE, PROT_READ,
                        MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
      assert(addr != MAP_FAILED);
      pid_t parent_pid = getpid();
      if ((child = fork()) == 0) {
        void *addr2 = mmap(NULL, TEST_SIZE, PROT_READ | PROT_WRITE,
                           MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
        assert(addr2 != MAP_FAILED);
        memset(addr2, 'a', TEST_SIZE);
        pwrite(fd, addr2, TEST_SIZE, (uintptr_t)addr);
        return 0;
      }
      assert(child == waitpid(child, &status, 0));
      assert(WIFEXITED(status) && WEXITSTATUS(status) == 0);
      return 0;
    }

Fix this by updating follow_trans_huge_pmd in huge_memory.c analogously
to the update in gup.c in the original commit.  The same pattern exists
in follow_devmap_pmd.  However, we should not be able to reach that
check with FOLL_COW set, so add WARN_ONCE to make sure we notice if we
ever do.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20170106015025.GA38411@juliacomputing.com
Signed-off-by: Keno Fischer 
Acked-by: Kirill A. Shutemov 
Cc: Greg Thelen 
Cc: Nicholas Piggin 
Cc: Willy Tarreau 
Cc: Oleg Nesterov 
Cc: Kees Cook 
Cc: Andy Lutomirski 
Cc: Michal Hocko 
Cc: Hugh Dickins 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

Signed-off-by: Sasha Levin

mm/mempolicy.c: fix error handling in set_mempolicy and mbind.

2017-06-13T13:29:16+00:00

[ Upstream commit cf01fb9985e8deb25ccf0ea54d916b8871ae0e62 ]

In the case that compat_get_bitmap fails we do not want to copy the
bitmap to the user as it will contain uninitialized stack data and leak
sensitive data.

Signed-off-by: Chris Salls 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin

mlock: fix mlock count can not decrease in race condition

2017-06-08T10:42:00+00:00

[ Upstream commit 70feee0e1ef331b22cc51f383d532a0d043fbdcc ]

Kefeng reported that when running the follow test, the mlock count in
meminfo will increase permanently:

 [1] testcase
 linux:~ # cat test_mlockal
 grep Mlocked /proc/meminfo
  for j in `seq 0 10`
  do
 	for i in `seq 4 15`
 	do
 		./p_mlockall >> log &
 	done
 	sleep 0.2
 done
 # wait some time to let mlock counter decrease and 5s may not enough
 sleep 5
 grep Mlocked /proc/meminfo

 linux:~ # cat p_mlockall.c
 #include 
 #include 
 #include 

 #define SPACE_LEN	4096

 int main(int argc, char ** argv)
 {
	 	int ret;
	 	void *adr = malloc(SPACE_LEN);
	 	if (!adr)
	 		return -1;

	 	ret = mlockall(MCL_CURRENT | MCL_FUTURE);
	 	printf("mlcokall ret = %d\n", ret);

	 	ret = munlockall();
	 	printf("munlcokall ret = %d\n", ret);

	 	free(adr);
	 	return 0;
	 }

In __munlock_pagevec() we should decrement NR_MLOCK for each page where
we clear the PageMlocked flag.  Commit 1ebb7cc6a583 ("mm: munlock: batch
NR_MLOCK zone state updates") has introduced a bug where we don't
decrement NR_MLOCK for pages where we clear the flag, but fail to
isolate them from the lru list (e.g.  when the pages are on some other
cpu's percpu pagevec).  Since PageMlocked stays cleared, the NR_MLOCK
accounting gets permanently disrupted by this.

Fix it by counting the number of page whose PageMlock flag is cleared.

Fixes: 1ebb7cc6a583 (" mm: munlock: batch NR_MLOCK zone state updates")
Link: http://lkml.kernel.org/r/1495678405-54569-1-git-send-email-xieyisheng1@huawei.com
Signed-off-by: Yisheng Xie 
Reported-by: Kefeng Wang 
Tested-by: Kefeng Wang 
Cc: Vlastimil Babka 
Cc: Joern Engel 
Cc: Mel Gorman 
Cc: Michel Lespinasse 
Cc: Hugh Dickins 
Cc: Rik van Riel 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Xishi Qiu 
Cc: zhongjiang 
Cc: Hanjun Guo 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin

mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling

2017-06-08T10:42:00+00:00

[ Upstream commit ead07f6a867b5b1b41cf703735e8b39094987a7d ]

memory_failure() can run in 2 different mode (specified by
MF_COUNT_INCREASED) in page refcount perspective.  When
MF_COUNT_INCREASED is set, memory_failure() assumes that the caller
takes a refcount of the target page.  And if cleared, memory_failure()
takes it in it's own.

In current code, however, refcounting is done differently in each caller.
For example, madvise_hwpoison() uses get_user_pages_fast() and
hwpoison_inject() uses get_page_unless_zero().  So this inconsistent
refcounting causes refcount failure especially for thp tail pages.
Typical user visible effects are like memory leak or
VM_BUG_ON_PAGE(!page_count(page)) in isolate_lru_page().

To fix this refcounting issue, this patch introduces get_hwpoison_page()
to handle thp tail pages in the same manner for each caller of hwpoison
code.

memory_failure() might fail to split thp and in such case it returns
without completing page isolation.  This is not good because PageHWPoison
on the thp is still set and there's no easy way to unpoison such thps.  So
this patch try to roll back any action to the thp in "non anonymous thp"
case and "thp split failed" case, expecting an MCE(SRAR) generated by
later access afterward will properly free such thps.

[akpm@linux-foundation.org: fix CONFIG_HWPOISON_INJECT=m]
Signed-off-by: Naoya Horiguchi 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: "Kirill A. Shutemov" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

Signed-off-by: Sasha Levin

mm/memory-failure: split thp earlier in memory error handling

2017-06-08T10:42:00+00:00

[ Upstream commit 415c64c1453aa2bbcc7e30a38f8894d0894cb8ab ]

memory_failure() doesn't handle thp itself at this time and need to split
it before doing isolation.  Currently thp is split in the middle of
hwpoison_user_mappings(), but there're corner cases where memory_failure()
wrongly tries to handle thp without splitting.

1) "non anonymous" thp, which is not a normal operating mode of thp,
   but a memory error could hit a thp before anon_vma is initialized.  In
   such case, split_huge_page() fails and me_huge_page() (intended for
   hugetlb) is called for thp, which triggers BUG_ON in page_hstate().

2) !PageLRU case, where hwpoison_user_mappings() returns with
   SWAP_SUCCESS and the result is the same as case 1.

memory_failure() can't avoid splitting, so let's split it more earlier,
which also reduces code which are prepared for both of normal page and
thp.

Signed-off-by: Naoya Horiguchi 
Cc: Andi Kleen 
Cc: Tony Luck 
Cc: "Kirill A. Shutemov" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin

slub/memcg: cure the brainless abuse of sysfs attributes

2017-06-08T10:42:00+00:00

[ Upstream commit 478fe3037b2278d276d4cd9cd0ab06c4cb2e9b32 ]

memcg_propagate_slab_attrs() abuses the sysfs attribute file functions
to propagate settings from the root kmem_cache to a newly created
kmem_cache.  It does that with:

     attr->show(root, buf);
     attr->store(new, buf, strlen(bug);

Aside of being a lazy and absurd hackery this is broken because it does
not check the return value of the show() function.

Some of the show() functions return 0 w/o touching the buffer.  That
means in such a case the store function is called with the stale content
of the previous show().  That causes nonsense like invoking
kmem_cache_shrink() on a newly created kmem_cache.  In the worst case it
would cause handing in an uninitialized buffer.

This should be rewritten proper by adding a propagate() callback to
those slub_attributes which must be propagated and avoid that insane
conversion to and from ASCII, but that's too large for a hot fix.

Check at least the return value of the show() function, so calling
store() with stale content is prevented.

Steven said:
 "It can cause a deadlock with get_online_cpus() that has been uncovered
  by recent cpu hotplug and lockdep changes that Thomas and Peter have
  been doing.

     Possible unsafe locking scenario:

           CPU0                    CPU1
           ----                    ----
      lock(cpu_hotplug.lock);
                                   lock(slab_mutex);
                                   lock(cpu_hotplug.lock);
      lock(slab_mutex);

     *** DEADLOCK ***"

Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1705201244540.2255@nanos
Signed-off-by: Thomas Gleixner 
Reported-by: Steven Rostedt 
Acked-by: David Rientjes 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Peter Zijlstra 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: Joonsoo Kim 
Cc: Christoph Hellwig 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin

mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd()

2017-05-17T19:07:44+00:00

[ Upstream commit c9d398fa237882ea07167e23bcfc5e6847066518 ]

I found the race condition which triggers the following bug when
move_pages() and soft offline are called on a single hugetlb page
concurrently.

    Soft offlining page 0x119400 at 0x700000000000
    BUG: unable to handle kernel paging request at ffffea0011943820
    IP: follow_huge_pmd+0x143/0x190
    PGD 7ffd2067
    PUD 7ffd1067
    PMD 0
        [61163.582052] Oops: 0000 [#1] SMP
    Modules linked in: binfmt_misc ppdev virtio_balloon parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq ip_tables xfs libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel ata_piix serio_raw libata virtio_pci 8139cp virtio_ring virtio mii floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: cap_check]
    CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P           OE   4.11.0-rc2-mm1+ #2
    Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
    RIP: 0010:follow_huge_pmd+0x143/0x190
    RSP: 0018:ffffc90004bdbcd0 EFLAGS: 00010202
    RAX: 0000000465003e80 RBX: ffffea0004e34d30 RCX: 00003ffffffff000
    RDX: 0000000011943800 RSI: 0000000000080001 RDI: 0000000465003e80
    RBP: ffffc90004bdbd18 R08: 0000000000000000 R09: ffff880138d34000
    R10: ffffea0004650000 R11: 0000000000c363b0 R12: ffffea0011943800
    R13: ffff8801b8d34000 R14: ffffea0000000000 R15: 000077ff80000000
    FS:  00007fc977710740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffea0011943820 CR3: 000000007a746000 CR4: 00000000001406f0
    Call Trace:
     follow_page_mask+0x270/0x550
     SYSC_move_pages+0x4ea/0x8f0
     SyS_move_pages+0xe/0x10
     do_syscall_64+0x67/0x180
     entry_SYSCALL64_slow_path+0x25/0x25
    RIP: 0033:0x7fc976e03949
    RSP: 002b:00007ffe72221d88 EFLAGS: 00000246 ORIG_RAX: 0000000000000117
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc976e03949
    RDX: 0000000000c22390 RSI: 0000000000001400 RDI: 0000000000005827
    RBP: 00007ffe72221e00 R08: 0000000000c2c3a0 R09: 0000000000000004
    R10: 0000000000c363b0 R11: 0000000000000246 R12: 0000000000400650
    R13: 00007ffe72221ee0 R14: 0000000000000000 R15: 0000000000000000
    Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 0c 4c 01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49 01 d4 f6 45 bc 04 74 90 <49> 8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a
    RIP: follow_huge_pmd+0x143/0x190 RSP: ffffc90004bdbcd0
    CR2: ffffea0011943820
    ---[ end trace e4f81353a2d23232 ]---
    Kernel panic - not syncing: Fatal exception
    Kernel Offset: disabled

This bug is triggered when pmd_present() returns true for non-present
hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
Using pmd_present() to determine present/non-present for hugetlb is not
correct, because pmd_present() checks multiple bits (not only
_PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.

Fixes: e66f17ff7177 ("mm/hugetlb: take page table lock in follow_huge_pmd()")
Link: http://lkml.kernel.org/r/1490149898-20231-1-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi 
Acked-by: Hillf Danton 
Cc: Hugh Dickins 
Cc: Michal Hocko 
Cc: "Kirill A. Shutemov" 
Cc: Mike Kravetz 
Cc: Christian Borntraeger 
Cc: Gerald Schaefer 
Cc:         [4.0+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin

percpu: acquire pcpu_lock when updating pcpu_nr_empty_pop_pages

2017-05-17T19:07:00+00:00

[ Upstream commit 320661b08dd6f1746d5c7ab4eb435ec64b97cd45 ]

Update to pcpu_nr_empty_pop_pages in pcpu_alloc() is currently done
without holding pcpu_lock. This can lead to bad updates to the variable.
Add missing lock calls.

Fixes: b539b87fed37 ("percpu: implmeent pcpu_nr_empty_pop_pages and chunk->nr_populated")
Signed-off-by: Tahsin Erdogan 
Signed-off-by: Tejun Heo 
Cc: stable@vger.kernel.org # v3.18+
Signed-off-by: Sasha Levin

mm: do not access page->mapping directly on page_endio

2017-05-17T19:06:59+00:00

[ Upstream commit dd8416c47715cf324c9a16f13273f9fda87acfed ]

With rw_page, page_endio is used for completing IO on a page and it
propagates write error to the address space if the IO fails.  The
problem is it accesses page->mapping directly which might be okay for
file-backed pages but it shouldn't for anonymous page.  Otherwise, it
can corrupt one of field from anon_vma under us and system goes panic
randomly.

swap_writepage
  bdev_writepage
    ops->rw_page

I encountered the BUG during developing new zram feature and it was
really hard to figure it out because it made random crash, somtime
mmap_sem lockdep, sometime other places where places never related to
zram/zsmalloc, and not reproducible with some configuration.

When I consider how that bug is subtle and people do fast-swap test with
brd, it's worth to add stable mark, I think.

Fixes: dd6bd0d9c7db ("swap: use bdev_read_page() / bdev_write_page()")
Signed-off-by: Minchan Kim 
Acked-by: Michal Hocko 
Cc: Matthew Wilcox 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin

mm: vmpressure: fix sending wrong events on underflow

2017-05-17T19:06:59+00:00

[ Upstream commit e1587a4945408faa58d0485002c110eb2454740c ]

At the end of a window period, if the reclaimed pages is greater than
scanned, an unsigned underflow can result in a huge pressure value and
thus a critical event.  Reclaimed pages is found to go higher than
scanned because of the addition of reclaimed slab pages to reclaimed in
shrink_node without a corresponding increment to scanned pages.

Minchan Kim mentioned that this can also happen in the case of a THP
page where the scanned is 1 and reclaimed could be 512.

Link: http://lkml.kernel.org/r/1486641577-11685-1-git-send-email-vinmenon@codeaurora.org
Signed-off-by: Vinayak Menon 
Acked-by: Minchan Kim 
Acked-by: Michal Hocko 
Cc: Johannes Weiner 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: Rik van Riel 
Cc: Vladimir Davydov 
Cc: Anton Vorontsov 
Cc: Shiraz Hashim 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin