linux-stable.git/mm/memory.c, branch v5.5.7

mm/memory.c: add apply_to_existing_page_range() helper

2019-12-18T04:59:59+00:00

apply_to_page_range() takes an address range, and if any parts of it are
not covered by the existing page table hierarchy, it allocates memory to
fill them in.

In some use cases, this is not what we want - we want to be able to
operate exclusively on PTEs that are already in the tables.

Add apply_to_existing_page_range() for this.  Adjust the walker
functions for apply_to_page_range to take 'create', which switches them
between the old and new modes.

This will be used in KASAN vmalloc.

[akpm@linux-foundation.org: reduce code duplication]
[akpm@linux-foundation.org: s/apply_to_existing_pages/apply_to_existing_page_range/]
[akpm@linux-foundation.org: initialize __apply_to_page_range::err]
Link: http://lkml.kernel.org/r/20191205140407.1874-1-dja@axtens.net
Signed-off-by: Daniel Axtens 
Cc: Dmitry Vyukov 
Cc: Uladzislau Rezki (Sony) 
Cc: Alexander Potapenko 
Cc: Daniel Axtens 
Cc: Qian Cai 
Cc: Andrey Ryabinin 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: remove __ARCH_HAS_4LEVEL_HACK and include/asm-generic/4level-fixup.h

2019-12-05T03:44:15+00:00

There are no architectures that use include/asm-generic/4level-fixup.h
therefore it can be removed along with __ARCH_HAS_4LEVEL_HACK define.

Link: http://lkml.kernel.org/r/1572938135-31886-14-git-send-email-rppt@kernel.org
Signed-off-by: Mike Rapoport 
Cc: Anatoly Pugachev 
Cc: Anton Ivanov 
Cc: Arnd Bergmann 
Cc: "David S. Miller" 
Cc: Geert Uytterhoeven 
Cc: Greentime Hu 
Cc: Greg Ungerer 
Cc: Helge Deller 
Cc: "James E.J. Bottomley" 
Cc: Jeff Dike 
Cc: "Kirill A. Shutemov" 
Cc: Mark Salter 
Cc: Matt Turner 
Cc: Michal Simek 
Cc: Peter Rosin 
Cc: Richard Weinberger 
Cc: Rolf Eike Beer 
Cc: Russell King 
Cc: Russell King 
Cc: Sam Creasey 
Cc: Vincent Chen 
Cc: Vineet Gupta 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory.c: replace is_zero_pfn with is_huge_zero_pmd for thp

2019-12-05T03:44:11+00:00

For hugely mapped thp, we use is_huge_zero_pmd() to check if it's zero
page or not.

We do fill ptes with my_zero_pfn() when we split zero thp pmd, but this
is not what we have in vm_normal_page_pmd() -- pmd_trans_huge_lock()
makes sure of it.

This is a trivial fix for /proc/pid/numa_maps, and AFAIK nobody
complains about it.

Gerald Schaefer asked:
: Maybe the description could also mention the symptom of this bug?
: I would assume that it affects anon/dirty accounting in gather_pte_stats(),
: for huge mappings, if zero page mappings are not correctly recognized.

I came across this while I was looking at the code, so I'm not aware of
any symptom.

Link: http://lkml.kernel.org/r/20191108192629.201556-1-yuzhao@google.com
Signed-off-by: Yu Zhao 
Acked-by: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Ralph Campbell 
Cc: Will Deacon 
Cc: Peter Zijlstra 
Cc: "Aneesh Kumar K . V" 
Cc: Dave Airlie 
Cc: Thomas Hellstrom 
Cc: Souptick Joarder 
Cc: Gerald Schaefer 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: fix typos in comments when calling __SetPageUptodate()

2019-12-01T20:59:10+00:00

There are several places emphasise the effect of __SetPageUptodate(),
while the comment seems to have a typo in two places.

Link: http://lkml.kernel.org/r/20190926023705.7226-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang 
Reviewed-by: Andrew Morton 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory.c: fix a huge pud insertion race during faulting

2019-12-01T14:29:19+00:00

A huge pud page can theoretically be faulted in racing with pmd_alloc()
in __handle_mm_fault().  That will lead to pmd_alloc() returning an
invalid pmd pointer.

Fix this by adding a pud_trans_unstable() function similar to
pmd_trans_unstable() and check whether the pud is really stable before
using the pmd pointer.

Race:
  Thread 1:             Thread 2:                 Comment
  create_huge_pud()                               Fallback - not taken.
                        create_huge_pud()         Taken.
  pmd_alloc()                                     Returns an invalid pointer.

This will result in user-visible huge page data corruption.

Note that this was caught during a code audit rather than a real
experienced problem.  It looks to me like the only implementation that
currently creates huge pud pagetable entries is dev_dax_huge_fault()
which doesn't appear to care much about private (COW) mappings or
write-tracking which is, I believe, a prerequisite for create_huge_pud()
falling back on thread 1, but not in thread 2.

Link: http://lkml.kernel.org/r/20191115115808.21181-2-thomas_os@shipmail.org
Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages")
Signed-off-by: Thomas Hellstrom 
Acked-by: Kirill A. Shutemov 
Cc: Arnd Bergmann 
Cc: Matthew Wilcox 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

rss_stat: add support to detect RSS updates of external mm

2019-12-01T14:29:18+00:00

When a process updates the RSS of a different process, the rss_stat
tracepoint appears in the context of the process doing the update.  This
can confuse userspace that the RSS of process doing the update is
updated, while in reality a different process's RSS was updated.

This issue happens in reclaim paths such as with direct reclaim or
background reclaim.

This patch adds more information to the tracepoint about whether the mm
being updated belongs to the current process's context (curr field).  We
also include a hash of the mm pointer so that the process who the mm
belongs to can be uniquely identified (mm_id field).

Also vsprintf.c is refactored a bit to allow reuse of hashing code.

[akpm@linux-foundation.org: remove unused local `str']
[joelaf@google.com: inline call to ptr_to_hashval]
  Link: http://lore.kernel.org/r/20191113153816.14b95acd@gandalf.local.home
  Link: http://lkml.kernel.org/r/20191114164622.GC233237@google.com
Link: http://lkml.kernel.org/r/20191106024452.81923-1-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) 
Reported-by: Ioannis Ilkos 
Acked-by: Petr Mladek 	[lib/vsprintf.c]
Cc: Tim Murray 
Cc: Michal Hocko 
Cc: Carmen Jackson 
Cc: Mayank Gupta 
Cc: Daniel Colascione 
Cc: Steven Rostedt (VMware) 
Cc: Minchan Kim 
Cc: "Aneesh Kumar K.V" 
Cc: Dan Williams 
Cc: Jerome Glisse 
Cc: Matthew Wilcox 
Cc: Ralph Campbell 
Cc: Vlastimil Babka 
Cc: Steven Rostedt 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: emit tracepoint when RSS changes

2019-12-01T14:29:18+00:00

Useful to track how RSS is changing per TGID to detect spikes in RSS and
memory hogs.  Several Android teams have been using this patch in
various kernel trees for half a year now.  Many reported to me it is
really useful so I'm posting it upstream.

Initial patch developed by Tim Murray.  Changes I made from original
patch: o Prevent any additional space consumed by mm_struct.

Regarding the fact that the RSS may change too often thus flooding the
traces - note that, there is some "hysterisis" with this already.  That
is - We update the counter only if we receive 64 page faults due to
SPLIT_RSS_ACCOUNTING.  However, during zapping or copying of pte range,
the RSS is updated immediately which can become noisy/flooding.  In a
previous discussion, we agreed that BPF or ftrace can be used to rate
limit the signal if this becomes an issue.

Also note that I added wrappers to trace_rss_stat to prevent compiler
errors where linux/mm.h is included from tracing code, causing errors
such as:

    CC      kernel/trace/power-traces.o
  In file included from ./include/trace/define_trace.h:102,
                   from ./include/trace/events/kmem.h:342,
                   from ./include/linux/mm.h:31,
                   from ./include/linux/ring_buffer.h:5,
                   from ./include/linux/trace_events.h:6,
                   from ./include/trace/events/power.h:12,
                   from kernel/trace/power-traces.c:15:
  ./include/trace/trace_events.h:113:22: error: field `ent' has incomplete type
     struct trace_entry ent;    \

Link: http://lore.kernel.org/r/20190903200905.198642-1-joel@joelfernandes.org
Link: http://lkml.kernel.org/r/20191001172817.234886-1-joel@joelfernandes.org
Co-developed-by: Tim Murray 
Signed-off-by: Tim Murray 
Signed-off-by: Joel Fernandes (Google) 
Acked-by: Michal Hocko 
Cc: Carmen Jackson 
Cc: Mayank Gupta 
Cc: Daniel Colascione 
Cc: Steven Rostedt (VMware) 
Cc: Minchan Kim 
Cc: "Aneesh Kumar K.V" 
Cc: Dan Williams 
Cc: Jerome Glisse 
Cc: Matthew Wilcox 
Cc: Ralph Campbell 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: drop mmap_sem before calling balance_dirty_pages() in write fault

2019-12-01T14:29:18+00:00

One of our services is observing hanging ps/top/etc under heavy write
IO, and the task states show this is an mmap_sem priority inversion:

A write fault is holding the mmap_sem in read-mode and waiting for
(heavily cgroup-limited) IO in balance_dirty_pages():

    balance_dirty_pages+0x724/0x905
    balance_dirty_pages_ratelimited+0x254/0x390
    fault_dirty_shared_page.isra.96+0x4a/0x90
    do_wp_page+0x33e/0x400
    __handle_mm_fault+0x6f0/0xfa0
    handle_mm_fault+0xe4/0x200
    __do_page_fault+0x22b/0x4a0
    page_fault+0x45/0x50

Somebody tries to change the address space, contending for the mmap_sem in
write-mode:

    call_rwsem_down_write_failed_killable+0x13/0x20
    do_mprotect_pkey+0xa8/0x330
    SyS_mprotect+0xf/0x20
    do_syscall_64+0x5b/0x100
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

The waiting writer locks out all subsequent readers to avoid lock
starvation, and several threads can be seen hanging like this:

    call_rwsem_down_read_failed+0x14/0x30
    proc_pid_cmdline_read+0xa0/0x480
    __vfs_read+0x23/0x140
    vfs_read+0x87/0x130
    SyS_read+0x42/0x90
    do_syscall_64+0x5b/0x100
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

To fix this, do what we do for cache read faults already: drop the
mmap_sem before calling into anything IO bound, in this case the
balance_dirty_pages() function, and return VM_FAULT_RETRY.

Link: http://lkml.kernel.org/r/20190924194238.GA29030@cmpxchg.org
Signed-off-by: Johannes Weiner 
Reviewed-by: Matthew Wilcox (Oracle) 
Acked-by: Kirill A. Shutemov 
Cc: Josef Bacik 
Cc: Hillf Danton 
Cc: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: fix double page fault on arm64 if PTE_AF is cleared

2019-10-18T10:11:29+00:00

When we tested pmdk unit test [1] vmmalloc_fork TEST3 on arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

To reproduce the bug, the cmd is as follows after you deployed everything:
make -C src/test/vmmalloc_fork/ TEST_TIME=60m check

Below call trace is from arm64 do_page_fault for debugging purpose:
[  110.016195] Call trace:
[  110.016826]  do_page_fault+0x5a4/0x690
[  110.017812]  do_mem_abort+0x50/0xb0
[  110.018726]  el1_da+0x20/0xc4
[  110.019492]  __arch_copy_from_user+0x180/0x280
[  110.020646]  do_wp_page+0xb0/0x860
[  110.021517]  __handle_mm_fault+0x994/0x1338
[  110.022606]  handle_mm_fault+0xe8/0x180
[  110.023584]  do_page_fault+0x240/0x690
[  110.024535]  do_mem_abort+0x50/0xb0
[  110.025423]  el0_da+0x20/0x24

The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
[ffff9b007000] pgd=000000023d4f8003, pud=000000023da9b003,
               pmd=000000023d4b3003, pte=360000298607bd3

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

This patch fixes it by calling pte_mkyoung. Also, the parameter is
changed because vmf should be passed to cow_user_page()

Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
in case there can be some obscure use-case (by Kirill).

[1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork

Signed-off-by: Jia He 
Reported-by: Yibo Cai 
Reviewed-by: Catalin Marinas 
Acked-by: Kirill A. Shutemov 
Signed-off-by: Catalin Marinas

mm: do not hash address in print_bad_pte()

2019-09-24T22:54:09+00:00

Using %px to show the actual address in print_bad_pte()
to help us to debug issue.

Link: http://lkml.kernel.org/r/20190831011816.141002-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang 
Acked-by: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds