linux.git/mm/memcontrol.c, branch v5.14

mm/memcg: fix incorrect flushing of lruvec data in obj_stock

2021-08-14T00:09:32+00:00

When mod_objcg_state() is called with a pgdat that is different from
that in the obj_stock, the old lruvec data cached in obj_stock are
flushed out.  Unfortunately, they were flushed to the new pgdat and so
the data go to the wrong node.  This will screw up the slab data
reported in /sys/devices/system/node/node*/meminfo.

Fix that by flushing the data to the cached pgdat instead.

Link: https://lkml.kernel.org/r/20210802143834.30578-1-longman@redhat.com
Fixes: 68ac5b3c8db2 ("mm/memcg: cache vmstat data in percpu memcg_stock_pcp")
Signed-off-by: Waiman Long 
Acked-by: Michal Hocko 
Reviewed-by: Shakeel Butt 
Acked-by: Roman Gushchin 
Cc: Johannes Weiner 
Cc: Vladimir Davydov 
Cc: Tejun Heo 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Vlastimil Babka 
Cc: Muchun Song 
Cc: Alex Shi 
Cc: Chris Down 
Cc: Yafang Shao 
Cc: Wei Yang 
Cc: Masayoshi Mizuma 
Cc: Xing Zhengjun 
Cc: Matthew Wilcox 
Cc: Waiman Long 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: memcontrol: fix blocking rstat function called from atomic cgroup1 thresholding code

2021-07-30T17:14:39+00:00

Dan Carpenter reports:

    The patch 2d146aa3aa84: "mm: memcontrol: switch to rstat" from Apr
    29, 2021, leads to the following static checker warning:

	    kernel/cgroup/rstat.c:200 cgroup_rstat_flush()
	    warn: sleeping in atomic context

    mm/memcontrol.c
      3572  static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
      3573  {
      3574          unsigned long val;
      3575
      3576          if (mem_cgroup_is_root(memcg)) {
      3577                  cgroup_rstat_flush(memcg->css.cgroup);
			    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    This is from static analysis and potentially a false positive.  The
    problem is that mem_cgroup_usage() is called from __mem_cgroup_threshold()
    which holds an rcu_read_lock().  And the cgroup_rstat_flush() function
    can sleep.

      3578                  val = memcg_page_state(memcg, NR_FILE_PAGES) +
      3579                          memcg_page_state(memcg, NR_ANON_MAPPED);
      3580                  if (swap)
      3581                          val += memcg_page_state(memcg, MEMCG_SWAP);
      3582          } else {
      3583                  if (!swap)
      3584                          val = page_counter_read(&memcg->memory);
      3585                  else
      3586                          val = page_counter_read(&memcg->memsw);
      3587          }
      3588          return val;
      3589  }

__mem_cgroup_threshold() indeed holds the rcu lock.  In addition, the
thresholding code is invoked during stat changes, and those contexts
have irqs disabled as well.  If the lock breaking occurs inside the
flush function, it will result in a sleep from an atomic context.

Use the irqsafe flushing variant in mem_cgroup_usage() to fix this.

Link: https://lkml.kernel.org/r/20210726150019.251820-1-hannes@cmpxchg.org
Fixes: 2d146aa3aa84 ("mm: memcontrol: switch to rstat")
Signed-off-by: Johannes Weiner 
Reported-by: Dan Carpenter 
Acked-by: Chris Down 
Reviewed-by: Rik van Riel 
Acked-by: Michal Hocko 
Reviewed-by: Shakeel Butt 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Merge branch 'akpm' (patches from Andrew)

2021-07-02T19:08:10+00:00

Merge more updates from Andrew Morton:
 "190 patches.

  Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
  vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
  migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
  zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
  core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
  signals, exec, kcov, selftests, compress/decompress, and ipc"

* emailed patches from Andrew Morton : (190 commits)
  ipc/util.c: use binary search for max_idx
  ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
  ipc: use kmalloc for msg_queue and shmid_kernel
  ipc sem: use kvmalloc for sem_undo allocation
  lib/decompressors: remove set but not used variabled 'level'
  selftests/vm/pkeys: exercise x86 XSAVE init state
  selftests/vm/pkeys: refill shadow register after implicit kernel write
  selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
  selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
  kcov: add __no_sanitize_coverage to fix noinstr for all architectures
  exec: remove checks in __register_bimfmt()
  x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
  hfsplus: report create_date to kstat.btime
  hfsplus: remove unnecessary oom message
  nilfs2: remove redundant continue statement in a while-loop
  kprobes: remove duplicated strong free_insn_page in x86 and s390
  init: print out unknown kernel parameters
  checkpatch: do not complain about positive return values starting with EPOLL
  checkpatch: improve the indented label test
  checkpatch: scripts/spdxcheck.py now requires python3
  ...

Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu

2021-07-02T00:17:24+00:00

Pull percpu updates from Dennis Zhou:

 - percpu chunk depopulation - depopulate backing pages for chunks with
   empty pages when we exceed a global threshold without those pages.
   This lets us reclaim a portion of memory that would previously be
   lost until the full chunk would be freed (possibly never).

 - memcg accounting cleanup - previously separate chunks were managed
   for normal allocations and __GFP_ACCOUNT allocations. These are now
   consolidated which cleans up the code quite a bit.

 - a few misc clean ups for clang warnings

* 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
  percpu: optimize locking in pcpu_balance_workfn()
  percpu: initialize best_upa variable
  percpu: rework memcg accounting
  mm, memcg: introduce mem_cgroup_kmem_disabled()
  mm, memcg: mark cgroup_memory_nosocket, nokmem and noswap as __ro_after_init
  percpu: make symbol 'pcpu_free_slot' static
  percpu: implement partial chunk depopulation
  percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
  percpu: factor out pcpu_check_block_hint()
  percpu: split __pcpu_balance_workfn()
  percpu: fix a comment about the chunks ordering

mm: remove special swap entry functions

2021-07-01T18:06:03+00:00

Patch series "Add support for SVM atomics in Nouveau", v11.

Introduction
============

Some devices have features such as atomic PTE bits that can be used to
implement atomic access to system memory.  To support atomic operations to
a shared virtual memory page such a device needs access to that page which
is exclusive of the CPU.  This series introduces a mechanism to
temporarily unmap pages granting exclusive access to a device.

These changes are required to support OpenCL atomic operations in Nouveau
to shared virtual memory (SVM) regions allocated with the
CL_MEM_SVM_ATOMICS clSVMAlloc flag.  A more complete description of the
OpenCL SVM feature is available at
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
OpenCL_API.html#_shared_virtual_memory .

Implementation
==============

Exclusive device access is implemented by adding a new swap entry type
(SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry.  The main
difference is that on fault the original entry is immediately restored by
the fault handler instead of waiting.

Restoring the entry triggers calls to MMU notifers which allows a device
driver to revoke the atomic access permission from the GPU prior to the
CPU finalising the entry.

Patches
=======

Patches 1 & 2 refactor existing migration and device private entry
functions.

Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
functionality into separate functions - try_to_migrate_one() and
try_to_munlock_one().

Patch 5 renames some existing code but does not introduce functionality.

Patch 6 is a small clean-up to swap entry handling in copy_pte_range().

Patch 7 contains the bulk of the implementation for device exclusive
memory.

Patch 8 contains some additions to the HMM selftests to ensure everything
works as expected.

Patch 9 is a cleanup for the Nouveau SVM implementation.

Patch 10 contains the implementation of atomic access for the Nouveau
driver.

Testing
=======

This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
which checks that GPU atomic accesses to system memory are atomic.
Without this series the test fails as there is no way of write-protecting
the page mapping which results in the device clobbering CPU writes.  For
reference the test is available at
https://ozlabs.org/~apopple/opencl_svm_atomics/

Further testing has been performed by adding support for testing exclusive
access to the hmm-tests kselftests.

This patch (of 10):

Remove multiple similar inline functions for dealing with different types
of special swap entries.

Both migration and device private swap entries use the swap offset to
store a pfn.  Instead of multiple inline functions to obtain a struct page
for each swap entry type use a common function pfn_swap_entry_to_page().
Also open-code the various entry_to_pfn() functions as this results is
shorter code that is easier to understand.

Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com
Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.com
Signed-off-by: Alistair Popple 
Reviewed-by: Ralph Campbell 
Reviewed-by: Christoph Hellwig 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Hugh Dickins 
Cc: Peter Xu 
Cc: Shakeel Butt 
Cc: Ben Skeggs 
Cc: Jason Gunthorpe 
Cc: John Hubbard 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memcontrol.c: fix kerneldoc comment for mem_cgroup_calculate_protection

2021-07-01T18:06:02+00:00

make W=1 generates the following warning for mem_cgroup_calculate_protection

  mm/memcontrol.c:6468: warning: expecting prototype for mem_cgroup_protected(). Prototype was for mem_cgroup_calculate_protection() instead

Commit 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from
protection checks") changed the function definition but not the associated
kerneldoc comment.

Link: https://lkml.kernel.org/r/20210520084809.8576-7-mgorman@techsingularity.net
Fixes: 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks")
Signed-off-by: Mel Gorman 
Reviewed-by: Yang Shi 
Acked-by: Chris Down 
Acked-by: Vlastimil Babka 
Cc: Dan Streetman 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

loop: charge i/o to mem and blk cg

2021-06-29T17:53:50+00:00

The current code only associates with the existing blkcg when aio is used
to access the backing file.  This patch covers all types of i/o to the
backing file and also associates the memcg so if the backing file is on
tmpfs, memory is charged appropriately.

This patch also exports cgroup_get_e_css and int_active_memcg so it can be
used by the loop module.

Link: https://lkml.kernel.org/r/20210610173944.1203706-4-schatzberg.dan@gmail.com
Signed-off-by: Dan Schatzberg 
Acked-by: Johannes Weiner 
Acked-by: Jens Axboe 
Cc: Chris Down 
Cc: Michal Hocko 
Cc: Ming Lei 
Cc: Shakeel Butt 
Cc: Tejun Heo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: charge active memcg when no mm is set

2021-06-29T17:53:50+00:00

set_active_memcg() worked for kernel allocations but was silently ignored
for user pages.

This patch establishes a precedence order for who gets charged:

1. If there is a memcg associated with the page already, that memcg is
   charged. This happens during swapin.

2. If an explicit mm is passed, mm->memcg is charged. This happens
   during page faults, which can be triggered in remote VMs (eg gup).

3. Otherwise consult the current process context. If there is an
   active_memcg, use that. Otherwise, current->mm->memcg.

Previously, if a NULL mm was passed to mem_cgroup_charge (case 3) it would
always charge the root cgroup.  Now it looks up the active_memcg first
(falling back to charging the root cgroup if not set).

Link: https://lkml.kernel.org/r/20210610173944.1203706-3-schatzberg.dan@gmail.com
Signed-off-by: Dan Schatzberg 
Acked-by: Johannes Weiner 
Acked-by: Tejun Heo 
Acked-by: Chris Down 
Acked-by: Jens Axboe 
Reviewed-by: Shakeel Butt 
Reviewed-by: Michal Koutný 
Cc: Michal Hocko 
Cc: Ming Lei 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: memcontrol: move obj_cgroup_uncharge_pages() out of css_set_lock

2021-06-29T17:53:50+00:00

The css_set_lock is used to guard the list of inherited objcgs.  So there
is no need to uncharge kernel memory under css_set_lock.  Just move it out
of the lock.

Link: https://lkml.kernel.org/r/20210417043538.9793-8-songmuchun@bytedance.com
Signed-off-by: Muchun Song 
Reviewed-by: Shakeel Butt 
Acked-by: Roman Gushchin 
Acked-by: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Xiongchun Duan 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: memcontrol: simplify the logic of objcg pinning memcg

2021-06-29T17:53:50+00:00

The obj_cgroup_release() and memcg_reparent_objcgs() are serialized by the
css_set_lock.  We do not need to care about objcg->memcg being released in
the process of obj_cgroup_release().  So there is no need to pin memcg
before releasing objcg.  Remove those pinning logic to simplfy the code.

There are only two places that modifies the objcg->memcg.  One is the
initialization to objcg->memcg in the memcg_online_kmem(), another is
objcgs reparenting in the memcg_reparent_objcgs().  It is also impossible
for the two to run in parallel.  So xchg() is unnecessary and it is enough
to use WRITE_ONCE().

Link: https://lkml.kernel.org/r/20210417043538.9793-7-songmuchun@bytedance.com
Signed-off-by: Muchun Song 
Acked-by: Johannes Weiner 
Reviewed-by: Shakeel Butt 
Acked-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Xiongchun Duan 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds