linux.git/mm/huge_memory.c, branch v5.11

mm: thp: fix MADV_REMOVE deadlock on shmem THP

2021-02-05T19:03:47+00:00

Sergey reported deadlock between kswapd correctly doing its usual
lock_page(page) followed by down_read(page->mapping->i_mmap_rwsem), and
madvise(MADV_REMOVE) on an madvise(MADV_HUGEPAGE) area doing
down_write(page->mapping->i_mmap_rwsem) followed by lock_page(page).

This happened when shmem_fallocate(punch hole)'s unmap_mapping_range()
reaches zap_pmd_range()'s call to __split_huge_pmd().  The same deadlock
could occur when partially truncating a mapped huge tmpfs file, or using
fallocate(FALLOC_FL_PUNCH_HOLE) on it.

__split_huge_pmd()'s page lock was added in 5.8, to make sure that any
concurrent use of reuse_swap_page() (holding page lock) could not catch
the anon THP's mapcounts and swapcounts while they were being split.

Fortunately, reuse_swap_page() is never applied to a shmem or file THP
(not even by khugepaged, which checks PageSwapCache before calling), and
anonymous THPs are never created in shmem or file areas: so that
__split_huge_pmd()'s page lock can only be necessary for anonymous THPs,
on which there is no risk of deadlock with i_mmap_rwsem.

Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2101161409470.2022@eggly.anvils
Fixes: c444eb564fb1 ("mm: thp: make the THP mapcount atomic against __split_huge_pmd_locked()")
Signed-off-by: Hugh Dickins 
Reported-by: Sergey Senozhatsky 
Reviewed-by: Andrea Arcangeli 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: fix some spelling mistakes in comments

2020-12-16T06:46:19+00:00

Fix some spelling mistakes in comments:
	udpate ==> update
	succesful ==> successful
	exmaple ==> example
	unneccessary ==> unnecessary
	stoping ==> stopping
	uknown ==> unknown

Link: https://lkml.kernel.org/r/20201127011747.86005-1-shihaitao1@huawei.com
Signed-off-by: Haitao Shi 
Reviewed-by: Mike Rapoport 
Reviewed-by: Souptick Joarder 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Merge branch 'akpm' (patches from Andrew)

2020-12-15T22:55:10+00:00

Merge more updates from Andrew Morton:
 "More MM work: a memcg scalability improvememt"

* emailed patches from Andrew Morton :
  mm/lru: revise the comments of lru_lock
  mm/lru: introduce relock_page_lruvec()
  mm/lru: replace pgdat lru_lock with lruvec lock
  mm/swap.c: serialize memcg changes in pagevec_lru_move_fn
  mm/compaction: do page isolation first in compaction
  mm/lru: introduce TestClearPageLRU()
  mm/mlock: remove __munlock_isolate_lru_page()
  mm/mlock: remove lru_lock on TestClearPageMlocked
  mm/vmscan: remove lruvec reget in move_pages_to_lru
  mm/lru: move lock into lru_note_cost
  mm/swap.c: fold vm event PGROTATED into pagevec_move_tail_fn
  mm/memcg: add debug checking in lock_page_memcg
  mm: page_idle_get_page() does not need lru_lock
  mm/rmap: stop store reordering issue on page->mapping
  mm/vmscan: remove unnecessary lruvec adding
  mm/thp: narrow lru locking
  mm/thp: simplify lru_add_page_tail()
  mm/thp: use head for head page in lru_add_page_tail()
  mm/thp: move lru_add_page_tail() to huge_memory.c

mm/lru: replace pgdat lru_lock with lruvec lock

2020-12-15T22:48:04+00:00

This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
each of memcg per node.  So on a large machine, each of memcg don't have
to suffer from per node pgdat->lru_lock competition.  They could go fast
with their self lru_lock.

After move memcg charge before lru inserting, page isolation could
serialize page's memcg, then per memcg lruvec lock is stable and could
replace per node lru lock.

In isolate_migratepages_block(), compact_unlock_should_abort and
lock_page_lruvec_irqsave are open coded to work with compact_control.
Also add a debug func in locking which may give some clues if there are
sth out of hands.

Daniel Jordan's testing show 62% improvement on modified readtwice case on
his 2P * 10 core * 2 HT broadwell box.
https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/

Hugh Dickins helped on the patch polish, thanks!

[alex.shi@linux.alibaba.com: fix comment typo]
  Link: https://lkml.kernel.org/r/5b085715-292a-4b43-50b3-d73dc90d1de5@linux.alibaba.com
[alex.shi@linux.alibaba.com: use page_memcg()]
  Link: https://lkml.kernel.org/r/5a4c2b72-7ee8-2478-fc0e-85eb83aafec4@linux.alibaba.com

Link: https://lkml.kernel.org/r/1604566549-62481-18-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Alex Shi 
Acked-by: Hugh Dickins 
Acked-by: Johannes Weiner 
Cc: Rong Chen 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Yang Shi 
Cc: Matthew Wilcox 
Cc: Konstantin Khlebnikov 
Cc: Daniel Jordan 
Cc: Alexander Duyck 
Cc: Andrea Arcangeli 
Cc: Andrey Ryabinin 
Cc: "Huang, Ying" 
Cc: Jann Horn 
Cc: Joonsoo Kim 
Cc: Kirill A. Shutemov 
Cc: Kirill A. Shutemov 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Mika Penttilä 
Cc: Minchan Kim 
Cc: Shakeel Butt 
Cc: Tejun Heo 
Cc: Thomas Gleixner 
Cc: Vlastimil Babka 
Cc: Wei Yang 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/thp: narrow lru locking

2020-12-15T22:48:03+00:00

lru_lock and page cache xa_lock have no obvious reason to be taken one
way round or the other: until now, lru_lock has been taken before page
cache xa_lock, when splitting a THP; but nothing else takes them
together.  Reverse that ordering: let's narrow the lru locking - but
leave local_irq_disable to block interrupts throughout, like before.

Hugh Dickins point: split_huge_page_to_list() was already silly, to be
using the _irqsave variant: it's just been taking sleeping locks, so
would already be broken if entered with interrupts enabled.  So we can
save passing flags argument down to __split_huge_page().

Why change the lock ordering here? That was hard to decide.  One reason:
when this series reaches per-memcg lru locking, it relies on the THP's
memcg to be stable when taking the lru_lock: that is now done after the
THP's refcount has been frozen, which ensures page memcg cannot change.

Another reason: previously, lock_page_memcg()'s move_lock was presumed
to nest inside lru_lock; but now lru_lock must nest inside (page cache
lock inside) move_lock, so it becomes possible to use lock_page_memcg()
to stabilize page memcg before taking its lru_lock.  That is not the
mechanism used in this series, but it is an option we want to keep open.

[hughd@google.com: rewrite commit log]

Link: https://lkml.kernel.org/r/1604566549-62481-5-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Alex Shi 
Reviewed-by: Kirill A. Shutemov 
Acked-by: Hugh Dickins 
Cc: Kirill A. Shutemov 
Cc: Andrea Arcangeli 
Cc: Johannes Weiner 
Cc: Matthew Wilcox 
Cc: Alexander Duyck 
Cc: Andrey Ryabinin 
Cc: "Chen, Rong A" 
Cc: Daniel Jordan 
Cc: "Huang, Ying" 
Cc: Jann Horn 
Cc: Joonsoo Kim 
Cc: Konstantin Khlebnikov 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Michal Hocko 
Cc: Mika Penttilä 
Cc: Minchan Kim 
Cc: Shakeel Butt 
Cc: Tejun Heo 
Cc: Thomas Gleixner 
Cc: Vladimir Davydov 
Cc: Vlastimil Babka 
Cc: Wei Yang 
Cc: Yang Shi 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/thp: simplify lru_add_page_tail()

2020-12-15T22:48:03+00:00

Simplify lru_add_page_tail(), there are actually only two cases
possible: split_huge_page_to_list(), with list supplied and head
isolated from lru by its caller; or split_huge_page(), with NULL list
and head on lru - because when head is racily isolated from lru, the
isolator's reference will stop the split from getting any further than
its page_ref_freeze().

So decide between the two cases by "list", but add VM_WARN_ON()s to
verify that they match our lru expectations.

[Hugh Dickins: rewrite commit log]

Link: https://lkml.kernel.org/r/1604566549-62481-4-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Alex Shi 
Reviewed-by: Kirill A. Shutemov 
Acked-by: Hugh Dickins 
Cc: Johannes Weiner 
Cc: Matthew Wilcox 
Cc: Mika Penttilä 
Cc: Alexander Duyck 
Cc: Andrea Arcangeli 
Cc: Andrey Ryabinin 
Cc: "Chen, Rong A" 
Cc: Daniel Jordan 
Cc: "Huang, Ying" 
Cc: Jann Horn 
Cc: Joonsoo Kim 
Cc: Kirill A. Shutemov 
Cc: Konstantin Khlebnikov 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Michal Hocko 
Cc: Minchan Kim 
Cc: Shakeel Butt 
Cc: Tejun Heo 
Cc: Thomas Gleixner 
Cc: Vladimir Davydov 
Cc: Vlastimil Babka 
Cc: Wei Yang 
Cc: Yang Shi 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/thp: use head for head page in lru_add_page_tail()

2020-12-15T22:48:03+00:00

Since the first parameter is only used by head page, it's better to make
it explicit.

Link: https://lkml.kernel.org/r/1604566549-62481-3-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Alex Shi 
Reviewed-by: Kirill A. Shutemov 
Reviewed-by: Matthew Wilcox (Oracle) 
Acked-by: Hugh Dickins 
Acked-by: Johannes Weiner 
Cc: Alexander Duyck 
Cc: Andrea Arcangeli 
Cc: Andrey Ryabinin 
Cc: "Chen, Rong A" 
Cc: Daniel Jordan 
Cc: "Huang, Ying" 
Cc: Jann Horn 
Cc: Joonsoo Kim 
Cc: Kirill A. Shutemov 
Cc: Konstantin Khlebnikov 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Michal Hocko 
Cc: Mika Penttilä 
Cc: Minchan Kim 
Cc: Shakeel Butt 
Cc: Tejun Heo 
Cc: Thomas Gleixner 
Cc: Vladimir Davydov 
Cc: Vlastimil Babka 
Cc: Wei Yang 
Cc: Yang Shi 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/thp: move lru_add_page_tail() to huge_memory.c

2020-12-15T22:48:03+00:00

Patch series "per memcg lru lock", v21.

This patchset includes 3 parts:

 1) some code cleanup and minimum optimization as preparation

 2) use TestCleanPageLRU as page isolation's precondition

 3) replace per node lru_lock with per memcg per node lru_lock

Current lru_lock is one for each of node, pgdat->lru_lock, that guard
for lru lists, but now we had moved the lru lists into memcg for long
time.  Still using per node lru_lock is clearly unscalable, pages on
each of memcgs have to compete each others for a whole lru_lock.  This
patchset try to use per lruvec/memcg lru_lock to repleace per node lru
lock to guard lru lists, make it scalable for memcgs and get performance
gain.

Currently lru_lock still guards both lru list and page's lru bit, that's
ok.  but if we want to use specific lruvec lock on the page, we need to
pin down the page's lruvec/memcg during locking.  Just taking lruvec
lock first may be undermined by the page's memcg charge/migration.  To
fix this problem, we could take out the page's lru bit clear and use it
as pin down action to block the memcg changes.  That's the reason for
new atomic func TestClearPageLRU.  So now isolating a page need both
actions: TestClearPageLRU and hold the lru_lock.

The typical usage of this is isolate_migratepages_block() in
compaction.c we have to take lru bit before lru lock, that serialized
the page isolation in memcg page charge/migration which will change
page's lruvec and new lru_lock in it.

The above solution suggested by Johannes Weiner, and based on his new
memcg charge path, then have this patchset.  (Hugh Dickins tested and
contributed much code from compaction fix to general code polish, thanks
a lot!).

Daniel Jordan's testing show 62% improvement on modified readtwice case
on his 2P * 10 core * 2 HT broadwell box on v18, which has no much
different with this v20.

 https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/

Thanks to Hugh Dickins and Konstantin Khlebnikov, they both brought this
idea 8 years ago, and others who gave comments as well: Daniel Jordan,
Mel Gorman, Shakeel Butt, Matthew Wilcox, Alexander Duyck etc.

Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
and Yun Wang.  Hugh Dickins also shared his kbuild-swap case.

This patch (of 19):

lru_add_page_tail() is only used in huge_memory.c, defining it in other
file with a CONFIG_TRANSPARENT_HUGEPAGE macro restrict just looks weird.

Let's move it THP. And make it static as Hugh Dickins suggested.

Link: https://lkml.kernel.org/r/1604566549-62481-1-git-send-email-alex.shi@linux.alibaba.com
Link: https://lkml.kernel.org/r/1604566549-62481-2-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Alex Shi 
Reviewed-by: Kirill A. Shutemov 
Acked-by: Hugh Dickins 
Acked-by: Johannes Weiner 
Cc: Matthew Wilcox 
Cc: Mel Gorman 
Cc: Tejun Heo 
Cc: Konstantin Khlebnikov 
Cc: Daniel Jordan 
Cc: Shakeel Butt 
Cc: Joonsoo Kim 
Cc: Wei Yang 
Cc: Alexander Duyck 
Cc: "Chen, Rong A" 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Andrea Arcangeli 
Cc: Andrey Ryabinin 
Cc: "Huang, Ying" 
Cc: Jann Horn 
Cc: Kirill A. Shutemov 
Cc: Michal Hocko 
Cc: Mika Penttilä 
Cc: Minchan Kim 
Cc: Thomas Gleixner 
Cc: Vlastimil Babka 
Cc: Yang Shi 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Merge tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

2020-12-15T21:22:29+00:00

Pull networking updates from Jakub Kicinski:
 "Core:

   - support "prefer busy polling" NAPI operation mode, where we defer
     softirq for some time expecting applications to periodically busy
     poll

   - AF_XDP: improve efficiency by more batching and hindering the
     adjacency cache prefetcher

   - af_packet: make packet_fanout.arr size configurable up to 64K

   - tcp: optimize TCP zero copy receive in presence of partial or
     unaligned reads making zero copy a performance win for much smaller
     messages

   - XDP: add bulk APIs for returning / freeing frames

   - sched: support fragmenting IP packets as they come out of conntrack

   - net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs

  BPF:

   - BPF switch from crude rlimit-based to memcg-based memory accounting

   - BPF type format information for kernel modules and related tracing
     enhancements

   - BPF implement task local storage for BPF LSM

   - allow the FENTRY/FEXIT/RAW_TP tracing programs to use
     bpf_sk_storage

  Protocols:

   - mptcp: improve multiple xmit streams support, memory accounting and
     many smaller improvements

   - TLS: support CHACHA20-POLY1305 cipher

   - seg6: add support for SRv6 End.DT4/DT6 behavior

   - sctp: Implement RFC 6951: UDP Encapsulation of SCTP

   - ppp_generic: add ability to bridge channels directly

   - bridge: Connectivity Fault Management (CFM) support as is defined
     in IEEE 802.1Q section 12.14.

  Drivers:

   - mlx5: make use of the new auxiliary bus to organize the driver
     internals

   - mlx5: more accurate port TX timestamping support

   - mlxsw:
      - improve the efficiency of offloaded next hop updates by using
        the new nexthop object API
      - support blackhole nexthops
      - support IEEE 802.1ad (Q-in-Q) bridging

   - rtw88: major bluetooth co-existance improvements

   - iwlwifi: support new 6 GHz frequency band

   - ath11k: Fast Initial Link Setup (FILS)

   - mt7915: dual band concurrent (DBDC) support

   - net: ipa: add basic support for IPA v4.5

  Refactor:

   - a few pieces of in_interrupt() cleanup work from Sebastian Andrzej
     Siewior

   - phy: add support for shared interrupts; get rid of multiple driver
     APIs and have the drivers write a full IRQ handler, slight growth
     of driver code should be compensated by the simpler API which also
     allows shared IRQs

   - add common code for handling netdev per-cpu counters

   - move TX packet re-allocation from Ethernet switch tag drivers to a
     central place

   - improve efficiency and rename nla_strlcpy

   - number of W=1 warning cleanups as we now catch those in a patchwork
     build bot

  Old code removal:

   - wan: delete the DLCI / SDLA drivers

   - wimax: move to staging

   - wifi: remove old WDS wifi bridging support"

* tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits)
  net: hns3: fix expression that is currently always true
  net: fix proc_fs init handling in af_packet and tls
  nfc: pn533: convert comma to semicolon
  af_vsock: Assign the vsock transport considering the vsock address flags
  af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path
  vsock_addr: Check for supported flag values
  vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag
  vm_sockets: Add flags field in the vsock address data structure
  net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled
  tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit
  net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context
  nfc: s3fwrn5: Release the nfc firmware
  net: vxget: clean up sparse warnings
  mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router
  mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3
  mlxsw: spectrum_router_xm: Introduce basic XM cache flushing
  mlxsw: reg: Add Router LPM Cache Enable Register
  mlxsw: reg: Add Router LPM Cache ML Delete Register
  mlxsw: spectrum_router_xm: Implement L-value tracking for M-index
  mlxsw: reg: Add XM Router M Table Register
  ...

mm: huge_memory: convert remaining use of sprintf to sysfs_emit and neatening

2020-12-15T20:13:47+00:00

Convert the only use of sprintf with struct kobject * that the cocci
script could not convert.

Miscellanea:

 - Neaten the uses of a constant string with sysfs_emit to use a const
   char * to reduce overall object size

Link: https://lkml.kernel.org/r/7df6be66bbd68e1a0bca9d35aca1341dbf94d2a7.1605376435.git.joe@perches.com
Signed-off-by: Joe Perches 
Cc: Christoph Lameter 
Cc: David Rientjes 
Cc: Greg Kroah-Hartman 
Cc: Hugh Dickins 
Cc: Joonsoo Kim 
Cc: Matthew Wilcox 
Cc: Mike Kravetz 
Cc: Pekka Enberg 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds