linux.git/mm/memcontrol.c, branch v4.15

mm, memcg: fix mem_cgroup_swapout() for THPs

2017-11-30T02:40:43+00:00

Commit d6810d730022 ("memcg, THP, swap: make mem_cgroup_swapout()
support THP") changed mem_cgroup_swapout() to support transparent huge
page (THP).

However the patch missed one location which should be changed for
correctly handling THPs.  The resulting bug will cause the memory
cgroups whose THPs were swapped out to become zombies on deletion.

Link: http://lkml.kernel.org/r/20171128161941.20931-1-shakeelb@google.com
Fixes: d6810d730022 ("memcg, THP, swap: make mem_cgroup_swapout() support THP")
Signed-off-by: Shakeel Butt 
Acked-by: Johannes Weiner 
Acked-by: Michal Hocko 
Cc: Huang Ying 
Cc: Vladimir Davydov 
Cc: Greg Thelen 
Cc: 

Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: slabinfo: remove CONFIG_SLABINFO

2017-11-16T02:21:01+00:00

According to discussion with Christoph
(https://marc.info/?l=linux-kernel&m=150695909709711&w=2), it sounds like
it is pointless to keep CONFIG_SLABINFO around.

This patch removes the CONFIG_SLABINFO config option, but /proc/slabinfo
is still available.

[yang.s@alibaba-inc.com: v11]
  Link: http://lkml.kernel.org/r/1507656303-103845-3-git-send-email-yang.s@alibaba-inc.com
Link: http://lkml.kernel.org/r/1507152550-46205-3-git-send-email-yang.s@alibaba-inc.com
Signed-off-by: Yang Shi 
Acked-by: David Rientjes 
Cc: Christoph Lameter 
Cc: Joonsoo Kim 
Cc: Michal Hocko 
Cc: Pekka Enberg 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

net: memcontrol: defer call to mem_cgroup_sk_alloc()

2017-10-10T03:55:01+00:00

Instead of calling mem_cgroup_sk_alloc() from BH context,
it is better to call it from inet_csk_accept() in process context.

Not only this removes code in mem_cgroup_sk_alloc(), but it also
fixes a bug since listener might have been dismantled and css_get()
might cause a use-after-free.

Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
Signed-off-by: Eric Dumazet 
Cc: Johannes Weiner 
Cc: Tejun Heo 
Signed-off-by: David S. Miller

mm/memcg: avoid page count check for zone device

2017-10-04T00:54:24+00:00

Fix for 4.14, zone device page always have an elevated refcount of one
and thus page count sanity check in uncharge_page() is inappropriate for
them.

[mhocko@suse.com: nano-optimize VM_BUG_ON in uncharge_page]
Link: http://lkml.kernel.org/r/20170914190011.5217-1-jglisse@redhat.com
Signed-off-by: Jérôme Glisse 
Signed-off-by: Michal Hocko 
Reported-by: Evgeny Baskakov 
Acked-by: Michal Hocko 
Cc: Johannes Weiner 
Cc: Vladimir Davydov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, memcg: remove hotplug locking from try_charge

2017-10-04T00:54:24+00:00

The following lockdep splat has been noticed during LTP testing

  ======================================================
  WARNING: possible circular locking dependency detected
  4.13.0-rc3-next-20170807 #12 Not tainted
  ------------------------------------------------------
  a.out/4771 is trying to acquire lock:
   (cpu_hotplug_lock.rw_sem){++++++}, at: [] drain_all_stock.part.35+0x18/0x140

  but task is already holding lock:
   (&mm->mmap_sem){++++++}, at: [] __do_page_fault+0x175/0x530

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #3 (&mm->mmap_sem){++++++}:
         lock_acquire+0xc9/0x230
         __might_fault+0x70/0xa0
         _copy_to_user+0x23/0x70
         filldir+0xa7/0x110
         xfs_dir2_sf_getdents.isra.10+0x20c/0x2c0 [xfs]
         xfs_readdir+0x1fa/0x2c0 [xfs]
         xfs_file_readdir+0x30/0x40 [xfs]
         iterate_dir+0x17a/0x1a0
         SyS_getdents+0xb0/0x160
         entry_SYSCALL_64_fastpath+0x1f/0xbe

  -> #2 (&type->i_mutex_dir_key#3){++++++}:
         lock_acquire+0xc9/0x230
         down_read+0x51/0xb0
         lookup_slow+0xde/0x210
         walk_component+0x160/0x250
         link_path_walk+0x1a6/0x610
         path_openat+0xe4/0xd50
         do_filp_open+0x91/0x100
         file_open_name+0xf5/0x130
         filp_open+0x33/0x50
         kernel_read_file_from_path+0x39/0x80
         _request_firmware+0x39f/0x880
         request_firmware_direct+0x37/0x50
         request_microcode_fw+0x64/0xe0
         reload_store+0xf7/0x180
         dev_attr_store+0x18/0x30
         sysfs_kf_write+0x44/0x60
         kernfs_fop_write+0x113/0x1a0
         __vfs_write+0x37/0x170
         vfs_write+0xc7/0x1c0
         SyS_write+0x58/0xc0
         do_syscall_64+0x6c/0x1f0
         return_from_SYSCALL_64+0x0/0x7a

  -> #1 (microcode_mutex){+.+.+.}:
         lock_acquire+0xc9/0x230
         __mutex_lock+0x88/0x960
         mutex_lock_nested+0x1b/0x20
         microcode_init+0xbb/0x208
         do_one_initcall+0x51/0x1a9
         kernel_init_freeable+0x208/0x2a7
         kernel_init+0xe/0x104
         ret_from_fork+0x2a/0x40

  -> #0 (cpu_hotplug_lock.rw_sem){++++++}:
         __lock_acquire+0x153c/0x1550
         lock_acquire+0xc9/0x230
         cpus_read_lock+0x4b/0x90
         drain_all_stock.part.35+0x18/0x140
         try_charge+0x3ab/0x6e0
         mem_cgroup_try_charge+0x7f/0x2c0
         shmem_getpage_gfp+0x25f/0x1050
         shmem_fault+0x96/0x200
         __do_fault+0x1e/0xa0
         __handle_mm_fault+0x9c3/0xe00
         handle_mm_fault+0x16e/0x380
         __do_page_fault+0x24a/0x530
         do_page_fault+0x30/0x80
         page_fault+0x28/0x30

  other info that might help us debug this:

  Chain exists of:
    cpu_hotplug_lock.rw_sem --> &type->i_mutex_dir_key#3 --> &mm->mmap_sem

   Possible unsafe locking scenario:

         CPU0                    CPU1
         ----                    ----
    lock(&mm->mmap_sem);
                                 lock(&type->i_mutex_dir_key#3);
                                 lock(&mm->mmap_sem);
    lock(cpu_hotplug_lock.rw_sem);

   *** DEADLOCK ***

  2 locks held by a.out/4771:
   #0:  (&mm->mmap_sem){++++++}, at: [] __do_page_fault+0x175/0x530
   #1:  (percpu_charge_mutex){+.+...}, at: [] try_charge+0x397/0x6e0

The problem is very similar to the one fixed by commit a459eeb7b852
("mm, page_alloc: do not depend on cpu hotplug locks inside the
allocator").  We are taking hotplug locks while we can be sitting on top
of basically arbitrary locks.  This just calls for problems.

We can get rid of {get,put}_online_cpus, fortunately.  We do not have to
be worried about races with memory hotplug because drain_local_stock,
which is called from both the WQ draining and the memory hotplug
contexts, is always operating on the local cpu stock with IRQs disabled.

The only thing to be careful about is that the target memcg doesn't
vanish while we are still in drain_all_stock so take a reference on it.

Link: http://lkml.kernel.org/r/20170913090023.28322-1-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Reported-by: Artem Savkov 
Tested-by: Artem Savkov 
Cc: Johannes Weiner 
Cc: Thomas Gleixner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mem/memcg: cache rightmost node

2017-09-09T01:26:49+00:00

Such that we can optimize __mem_cgroup_largest_soft_limit_node().  The
only overhead is the extra footprint for the cached pointer, but this
should not be an issue for mem_cgroup_tree_per_node.

[dave@stgolabs.net: brain fart #2]
  Link: http://lkml.kernel.org/r/20170731160114.GE21328@linux-80c1.suse
Link: http://lkml.kernel.org/r/20170719014603.19029-17-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Vladimir Davydov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: memcontrol: use per-cpu stocks for socket memory uncharging

2017-09-09T01:26:47+00:00

We've noticed a quite noticeable performance overhead on some hosts with
significant network traffic when socket memory accounting is enabled.

Perf top shows that socket memory uncharging path is hot:
  2.13%  [kernel]                [k] page_counter_cancel
  1.14%  [kernel]                [k] __sk_mem_reduce_allocated
  1.14%  [kernel]                [k] _raw_spin_lock
  0.87%  [kernel]                [k] _raw_spin_lock_irqsave
  0.84%  [kernel]                [k] tcp_ack
  0.84%  [kernel]                [k] ixgbe_poll
  0.83%  < workload >
  0.82%  [kernel]                [k] enqueue_entity
  0.68%  [kernel]                [k] __fget
  0.68%  [kernel]                [k] tcp_delack_timer_handler
  0.67%  [kernel]                [k] __schedule
  0.60%  < workload >
  0.59%  [kernel]                [k] __inet6_lookup_established
  0.55%  [kernel]                [k] __switch_to
  0.55%  [kernel]                [k] menu_select
  0.54%  libc-2.20.so            [.] __memcpy_avx_unaligned

To address this issue, the existing per-cpu stock infrastructure can be
used.

refill_stock() can be called from mem_cgroup_uncharge_skmem() to move
charge to a per-cpu stock instead of calling atomic
page_counter_uncharge().

To prevent the uncontrolled growth of per-cpu stocks, refill_stock()
will explicitly drain the cached charge, if the cached value exceeds
CHARGE_BATCH.

This allows significantly optimize the load:
  1.21%  [kernel]                [k] _raw_spin_lock
  1.01%  [kernel]                [k] ixgbe_poll
  0.92%  [kernel]                [k] _raw_spin_lock_irqsave
  0.90%  [kernel]                [k] enqueue_entity
  0.86%  [kernel]                [k] tcp_ack
  0.85%  < workload >
  0.74%  perf-11120.map          [.] 0x000000000061bf24
  0.73%  [kernel]                [k] __schedule
  0.67%  [kernel]                [k] __fget
  0.63%  [kernel]                [k] __inet6_lookup_established
  0.62%  [kernel]                [k] menu_select
  0.59%  < workload >
  0.59%  [kernel]                [k] __switch_to
  0.57%  libc-2.20.so            [.] __memcpy_avx_unaligned

Link: http://lkml.kernel.org/r/20170829100150.4580-1-guro@fb.com
Signed-off-by: Roman Gushchin 
Acked-by: Johannes Weiner 
Acked-by: Michal Hocko 
Cc: Vladimir Davydov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/device-public-memory: device memory cache coherent with CPU

2017-09-09T01:26:46+00:00

Platform with advance system bus (like CAPI or CCIX) allow device memory
to be accessible from CPU in a cache coherent fashion.  Add a new type of
ZONE_DEVICE to represent such memory.  The use case are the same as for
the un-addressable device memory but without all the corners cases.

Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com
Signed-off-by: Jérôme Glisse 
Cc: Aneesh Kumar 
Cc: Paul E. McKenney 
Cc: Benjamin Herrenschmidt 
Cc: Dan Williams 
Cc: Ross Zwisler 
Cc: Balbir Singh 
Cc: David Nellans 
Cc: Evgeny Baskakov 
Cc: Johannes Weiner 
Cc: John Hubbard 
Cc: Kirill A. Shutemov 
Cc: Mark Hairgrove 
Cc: Michal Hocko 
Cc: Sherry Cheung 
Cc: Subhash Gutti 
Cc: Vladimir Davydov 
Cc: Bob Liu 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memcontrol: support MEMORY_DEVICE_PRIVATE

2017-09-09T01:26:46+00:00

HMM pages (private or public device pages) are ZONE_DEVICE page and thus
need special handling when it comes to lru or refcount.  This patch make
sure that memcontrol properly handle those when it face them.  Those pages
are use like regular pages in a process address space either as anonymous
page or as file back page.  So from memcg point of view we want to handle
them like regular page for now at least.

Link: http://lkml.kernel.org/r/20170817000548.32038-11-jglisse@redhat.com
Signed-off-by: Jérôme Glisse 
Acked-by: Balbir Singh 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Aneesh Kumar 
Cc: Benjamin Herrenschmidt 
Cc: Dan Williams 
Cc: David Nellans 
Cc: Evgeny Baskakov 
Cc: John Hubbard 
Cc: Kirill A. Shutemov 
Cc: Mark Hairgrove 
Cc: Paul E. McKenney 
Cc: Ross Zwisler 
Cc: Sherry Cheung 
Cc: Subhash Gutti 
Cc: Bob Liu 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memcontrol: allow to uncharge page without using page->lru field

2017-09-09T01:26:46+00:00

HMM pages (private or public device pages) are ZONE_DEVICE page and
thus you can not use page->lru fields of those pages. This patch
re-arrange the uncharge to allow single page to be uncharge without
modifying the lru field of the struct page.

There is no change to memcontrol logic, it is the same as it was
before this patch.

Link: http://lkml.kernel.org/r/20170817000548.32038-10-jglisse@redhat.com
Signed-off-by: Jérôme Glisse 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Aneesh Kumar 
Cc: Balbir Singh 
Cc: Benjamin Herrenschmidt 
Cc: Dan Williams 
Cc: David Nellans 
Cc: Evgeny Baskakov 
Cc: John Hubbard 
Cc: Kirill A. Shutemov 
Cc: Mark Hairgrove 
Cc: Paul E. McKenney 
Cc: Ross Zwisler 
Cc: Sherry Cheung 
Cc: Subhash Gutti 
Cc: Bob Liu 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds