linux.git/mm/util.c, branch v4.17

mm/gup.c: document return value

2018-04-14T00:10:27+00:00

__get_user_pages_fast handles errors differently from
get_user_pages_fast: the former always returns the number of pages
pinned, the later might return a negative error code.

Link: http://lkml.kernel.org/r/1522962072-182137-6-git-send-email-mst@redhat.com
Signed-off-by: Michael S. Tsirkin 
Reviewed-by: Andrew Morton 
Cc: Kirill A. Shutemov 
Cc: Huang Ying 
Cc: Jonathan Corbet 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Thorsten Leemhuis 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

exec: pass stack rlimit into mm layout functions

2018-04-11T17:28:37+00:00

Patch series "exec: Pin stack limit during exec".

Attempts to solve problems with the stack limit changing during exec
continue to be frustrated[1][2].  In addition to the specific issues
around the Stack Clash family of flaws, Andy Lutomirski pointed out[3]
other places during exec where the stack limit is used and is assumed to
be unchanging.  Given the many places it gets used and the fact that it
can be manipulated/raced via setrlimit() and prlimit(), I think the only
way to handle this is to move away from the "current" view of the stack
limit and instead attach it to the bprm, and plumb this down into the
functions that need to know the stack limits.  This series implements
the approach.

[1] 04e35f4495dd ("exec: avoid RLIMIT_STACK races with prlimit()")
[2] 779f4e1c6c7c ("Revert "exec: avoid RLIMIT_STACK races with prlimit()"")
[3] to security@kernel.org, "Subject: existing rlimit races?"

This patch (of 3):

Since it is possible that the stack rlimit can change externally during
exec (either via another thread calling setrlimit() or another process
calling prlimit()), provide a way to pass the rlimit down into the
per-architecture mm layout functions so that the rlimit can stay in the
bprm structure instead of sitting in the signal structure until exec is
finalized.

Link: http://lkml.kernel.org/r/1518638796-20819-2-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook 
Cc: Michal Hocko 
Cc: Ben Hutchings 
Cc: Willy Tarreau 
Cc: Hugh Dickins 
Cc: Oleg Nesterov 
Cc: "Jason A. Donenfeld" 
Cc: Rik van Riel 
Cc: Laura Abbott 
Cc: Greg KH 
Cc: Andy Lutomirski 
Cc: Ben Hutchings 
Cc: Brad Spengler 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: treat indirectly reclaimable memory as free in overcommit logic

2018-04-11T17:28:29+00:00

Indirectly reclaimable memory can consume a significant part of total
memory and it's actually reclaimable (it will be released under actual
memory pressure).

So, the overcommit logic should treat it as free.

Otherwise, it's possible to cause random system-wide memory allocation
failures by consuming a significant amount of memory by indirectly
reclaimable memory, e.g.  dentry external names.

If overcommit policy GUESS is used, it might be used for denial of
service attack under some conditions.

The following program illustrates the approach.  It causes the kernel to
allocate an unreclaimable kmalloc-256 chunk for each stat() call, so
that at some point the overcommit logic may start blocking large
allocation system-wide.

  int main()
  {
  	char buf[256];
  	unsigned long i;
  	struct stat statbuf;

  	buf[0] = '/';
  	for (i = 1; i < sizeof(buf); i++)
  		buf[i] = '_';

  	for (i = 0; 1; i++) {
  		sprintf(&buf[248], "%8lu", i);
  		stat(buf, &statbuf);
  	}

  	return 0;
  }

This patch in combination with related indirectly reclaimable memory
patches closes this issue.

Link: http://lkml.kernel.org/r/20180313130041.8078-1-guro@fb.com
Signed-off-by: Roman Gushchin 
Reviewed-by: Andrew Morton 
Cc: Alexander Viro 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: fix races between swapoff and flush dcache

2018-04-06T04:36:26+00:00

Thanks to commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB
trunks"), after swapoff the address_space associated with the swap
device will be freed.  So page_mapping() users which may touch the
address_space need some kind of mechanism to prevent the address_space
from being freed during accessing.

The dcache flushing functions (flush_dcache_page(), etc) in architecture
specific code may access the address_space of swap device for anonymous
pages in swap cache via page_mapping() function.  But in some cases
there are no mechanisms to prevent the swap device from being swapoff,
for example,

  CPU1					CPU2
  __get_user_pages()			swapoff()
    flush_dcache_page()
      mapping = page_mapping()
        ...				  exit_swap_address_space()
        ...				    kvfree(spaces)
        mapping_mapped(mapping)

The address space may be accessed after being freed.

But from cachetlb.txt and Russell King, flush_dcache_page() only care
about file cache pages, for anonymous pages, flush_anon_page() should be
used.  The implementation of flush_dcache_page() in all architectures
follows this too.  They will check whether page_mapping() is NULL and
whether mapping_mapped() is true to determine whether to flush the
dcache immediately.  And they will use interval tree (mapping->i_mmap)
to find all user space mappings.  While mapping_mapped() and
mapping->i_mmap isn't used by anonymous pages in swap cache at all.

So, to fix the race between swapoff and flush dcache, __page_mapping()
is add to return the address_space for file cache pages and NULL
otherwise.  All page_mapping() invoking in flush dcache functions are
replaced with page_mapping_file().

[akpm@linux-foundation.org: simplify page_mapping_file(), per Mike]
Link: http://lkml.kernel.org/r/20180305083634.15174-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" 
Reviewed-by: Andrew Morton 
Cc: Minchan Kim 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Mel Gorman 
Cc: Dave Hansen 
Cc: Chen Liqin 
Cc: Russell King 
Cc: Yoshinori Sato 
Cc: "James E.J. Bottomley" 
Cc: Guan Xuetao 
Cc: "David S. Miller" 
Cc: Chris Zankel 
Cc: Vineet Gupta 
Cc: Ley Foon Tan 
Cc: Ralf Baechle 
Cc: Andi Kleen 
Cc: Mike Rapoport 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

new primitive: vmemdup_user()

2018-01-07T18:06:15+00:00

similar to memdup_user(), but does *not* guarantee that result will
be physically contiguous; use only in cases where that's not a requirement
and free it with kvfree().

Signed-off-by: Al Viro

memdup_user(): switch to GFP_USER

2018-01-07T18:00:27+00:00

Signed-off-by: Al Viro

mm: rename global_page_state to global_zone_page_state

2017-09-07T00:27:29+00:00

global_page_state is error prone as a recent bug report pointed out [1].
It only returns proper values for zone based counters as the enum it
gets suggests.  We already have global_node_page_state so let's rename
global_page_state to global_zone_page_state to be more explicit here.
All existing users seems to be correct:

$ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
      2 NR_BOUNCE
      2 NR_FREE_CMA_PAGES
     11 NR_FREE_PAGES
      1 NR_KERNEL_STACK_KB
      1 NR_MLOCK
      2 NR_PAGETABLE

This patch shouldn't introduce any functional change.

[1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp

Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org
Signed-off-by: Michal Hocko 
Signed-off-by: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: Josef Bacik 
Cc: Vladimir Davydov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: fix global NR_SLAB_.*CLAIMABLE counter reads

2017-08-10T22:54:06+00:00

As Tetsuo points out:
 "Commit 385386cff4c6 ("mm: vmstat: move slab statistics from zone to
  node counters") broke "Slab:" field of /proc/meminfo . It shows nearly
  0kB"

In addition to /proc/meminfo, this problem also affects the slab
counters OOM/allocation failure info dumps, can cause early -ENOMEM from
overcommit protection, and miscalculate image size requirements during
suspend-to-disk.

This is because the patch in question switched the slab counters from
the zone level to the node level, but forgot to update the global
accessor functions to read the aggregate node data instead of the
aggregate zone data.

Use global_node_page_state() to access the global slab counters.

Fixes: 385386cff4c6 ("mm: vmstat: move slab statistics from zone to node counters")
Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner 
Reported-by: Tetsuo Handa 
Acked-by: Michal Hocko 
Cc: Josef Bacik 
Cc: Vladimir Davydov 
Cc: Stefan Agner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

2017-07-15T19:00:42+00:00

Pull ->s_options removal from Al Viro:
 "Preparations for fsmount/fsopen stuff (coming next cycle). Everything
  gets moved to explicit ->show_options(), killing ->s_options off +
  some cosmetic bits around fs/namespace.c and friends. Basically, the
  stuff needed to work with fsmount series with minimum of conflicts
  with other work.

  It's not strictly required for this merge window, but it would reduce
  the PITA during the coming cycle, so it would be nice to have those
  bits and pieces out of the way"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  isofs: Fix isofs_show_options()
  VFS: Kill off s_options and helpers
  orangefs: Implement show_options
  9p: Implement show_options
  isofs: Implement show_options
  afs: Implement show_options
  affs: Implement show_options
  befs: Implement show_options
  spufs: Implement show_options
  bpf: Implement show_options
  ramfs: Implement show_options
  pstore: Implement show_options
  omfs: Implement show_options
  hugetlbfs: Implement show_options
  VFS: Don't use save/replace_mount_options if not using generic_show_options
  VFS: Provide empty name qstr
  VFS: Make get_filesystem() return the affected filesystem
  VFS: Clean up whitespace in fs/namespace.c and fs/super.c
  Provide a function to create a NUL-terminated string from unterminated data

mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizes

2017-07-12T23:26:03+00:00

Now that __GFP_RETRY_MAYFAIL has a reasonable semantic regardless of the
request size we can drop the hackish implementation for !costly orders.
__GFP_RETRY_MAYFAIL retries as long as the reclaim makes a forward
progress and backs of when we are out of memory for the requested size.
Therefore we do not need to enforce__GFP_NORETRY for !costly orders just
to silent the oom killer anymore.

Link: http://lkml.kernel.org/r/20170623085345.11304-5-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Cc: Alex Belits 
Cc: Chris Wilson 
Cc: Christoph Hellwig 
Cc: Darrick J. Wong 
Cc: David Daney 
Cc: Johannes Weiner 
Cc: Mel Gorman 
Cc: NeilBrown 
Cc: Ralf Baechle 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds