linux.git/mm/mincore.c, branch v5.8

mmap locking API: use coccinelle to convert mmap_sem rwsem call sites

2020-06-09T16:39:14+00:00

This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse 
Signed-off-by: Andrew Morton 
Reviewed-by: Daniel Jordan 
Reviewed-by: Laurent Dufour 
Reviewed-by: Vlastimil Babka 
Cc: Davidlohr Bueso 
Cc: David Rientjes 
Cc: Hugh Dickins 
Cc: Jason Gunthorpe 
Cc: Jerome Glisse 
Cc: John Hubbard 
Cc: Liam Howlett 
Cc: Matthew Wilcox 
Cc: Peter Zijlstra 
Cc: Ying Han 
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds

mm: reorder includes after introduction of linux/pgtable.h

2020-06-09T16:39:13+00:00

The replacement of  with  made the include
of the latter in the middle of asm includes.  Fix this up with the aid of
the below script and manual adjustments here and there.

	import sys
	import re

	if len(sys.argv) is not 3:
	    print "USAGE: %s  " % (sys.argv[0])
	    sys.exit(1)

	hdr_to_move="#include " % sys.argv[2]
	moved = False
	in_hdrs = False

	with open(sys.argv[1], "r") as f:
	    lines = f.readlines()
	    for _line in lines:
		line = _line.rstrip('
')
		if line == hdr_to_move:
		    continue
		if line.startswith("#include 
Signed-off-by: Andrew Morton 
Cc: Arnd Bergmann 
Cc: Borislav Petkov 
Cc: Brian Cain 
Cc: Catalin Marinas 
Cc: Chris Zankel 
Cc: "David S. Miller" 
Cc: Geert Uytterhoeven 
Cc: Greentime Hu 
Cc: Greg Ungerer 
Cc: Guan Xuetao 
Cc: Guo Ren 
Cc: Heiko Carstens 
Cc: Helge Deller 
Cc: Ingo Molnar 
Cc: Ley Foon Tan 
Cc: Mark Salter 
Cc: Matthew Wilcox 
Cc: Matt Turner 
Cc: Max Filippov 
Cc: Michael Ellerman 
Cc: Michal Simek 
Cc: Nick Hu 
Cc: Paul Walmsley 
Cc: Richard Weinberger 
Cc: Rich Felker 
Cc: Russell King 
Cc: Stafford Horne 
Cc: Thomas Bogendoerfer 
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: Vincent Chen 
Cc: Vineet Gupta 
Cc: Will Deacon 
Cc: Yoshinori Sato 
Link: http://lkml.kernel.org/r/20200514170327.31389-4-rppt@kernel.org
Signed-off-by: Linus Torvalds

mm: introduce include/linux/pgtable.h

2020-06-09T16:39:13+00:00

The include/linux/pgtable.h is going to be the home of generic page table
manipulation functions.

Start with moving asm-generic/pgtable.h to include/linux/pgtable.h and
make the latter include asm/pgtable.h.

Signed-off-by: Mike Rapoport 
Signed-off-by: Andrew Morton 
Cc: Arnd Bergmann 
Cc: Borislav Petkov 
Cc: Brian Cain 
Cc: Catalin Marinas 
Cc: Chris Zankel 
Cc: "David S. Miller" 
Cc: Geert Uytterhoeven 
Cc: Greentime Hu 
Cc: Greg Ungerer 
Cc: Guan Xuetao 
Cc: Guo Ren 
Cc: Heiko Carstens 
Cc: Helge Deller 
Cc: Ingo Molnar 
Cc: Ley Foon Tan 
Cc: Mark Salter 
Cc: Matthew Wilcox 
Cc: Matt Turner 
Cc: Max Filippov 
Cc: Michael Ellerman 
Cc: Michal Simek 
Cc: Nick Hu 
Cc: Paul Walmsley 
Cc: Richard Weinberger 
Cc: Rich Felker 
Cc: Russell King 
Cc: Stafford Horne 
Cc: Thomas Bogendoerfer 
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: Vincent Chen 
Cc: Vineet Gupta 
Cc: Will Deacon 
Cc: Yoshinori Sato 
Link: http://lkml.kernel.org/r/20200514170327.31389-3-rppt@kernel.org
Signed-off-by: Linus Torvalds

mm: pagewalk: add 'depth' parameter to pte_hole

2020-02-04T03:05:25+00:00

The pte_hole() callback is called at multiple levels of the page tables.
Code dumping the kernel page tables needs to know what at what depth the
missing entry is.  Add this is an extra parameter to pte_hole().  When the
depth isn't know (e.g.  processing a vma) then -1 is passed.

The depth that is reported is the actual level where the entry is missing
(ignoring any folding that is in place), i.e.  any levels where
PTRS_PER_P?D is set to 1 are ignored.

Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their
natural numbers as levels 2/3/4.

Link: http://lkml.kernel.org/r/20191218162402.45610-16-steven.price@arm.com
Signed-off-by: Steven Price 
Tested-by: Zong Li 
Cc: Albert Ou 
Cc: Alexandre Ghiti 
Cc: Andy Lutomirski 
Cc: Ard Biesheuvel 
Cc: Arnd Bergmann 
Cc: Benjamin Herrenschmidt 
Cc: Borislav Petkov 
Cc: Catalin Marinas 
Cc: Christian Borntraeger 
Cc: Dave Hansen 
Cc: David S. Miller 
Cc: Heiko Carstens 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: James Hogan 
Cc: James Morse 
Cc: Jerome Glisse 
Cc: "Liang, Kan" 
Cc: Mark Rutland 
Cc: Michael Ellerman 
Cc: Paul Burton 
Cc: Paul Mackerras 
Cc: Paul Walmsley 
Cc: Peter Zijlstra 
Cc: Ralf Baechle 
Cc: Russell King 
Cc: Thomas Gleixner 
Cc: Vasily Gorbik 
Cc: Vineet Gupta 
Cc: Will Deacon 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: untag user pointers passed to memory syscalls

2019-09-26T00:51:41+00:00

This patch is a part of a series that extends kernel ABI to allow to pass
tagged user pointers (with the top byte set to something else other than
0x00) as syscall arguments.

This patch allows tagged pointers to be passed to the following memory
syscalls: get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mprotect,
mremap, msync, munlock, move_pages.

The mmap and mremap syscalls do not currently accept tagged addresses.
Architectures may interpret the tag as a background colour for the
corresponding vma.

Link: http://lkml.kernel.org/r/aaf0c0969d46b2feb9017f3e1b3ef3970b633d91.1563904656.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov 
Reviewed-by: Khalid Aziz 
Reviewed-by: Vincenzo Frascino 
Reviewed-by: Catalin Marinas 
Reviewed-by: Kees Cook 
Cc: Al Viro 
Cc: Dave Hansen 
Cc: Eric Auger 
Cc: Felix Kuehling 
Cc: Jens Wiklander 
Cc: Mauro Carvalho Chehab 
Cc: Mike Rapoport 
Cc: Will Deacon 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

pagewalk: separate function pointers from iterator data

2019-09-07T07:28:04+00:00

The mm_walk structure currently mixed data and code.  Split out the
operations vectors into a new mm_walk_ops structure, and while we are
changing the API also declare the mm_walk structure inside the
walk_page_range and walk_page_vma functions.

Based on patch from Linus Torvalds.

Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de
Signed-off-by: Christoph Hellwig 
Reviewed-by: Thomas Hellstrom 
Reviewed-by: Steven Price 
Reviewed-by: Jason Gunthorpe 
Signed-off-by: Jason Gunthorpe

mm: split out a new pagewalk.h header from mm.h

2019-09-07T07:28:04+00:00

Add a new header for the two handful of users of the walk_page_range /
walk_page_vma interface instead of polluting all users of mm.h with it.

Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de
Signed-off-by: Christoph Hellwig 
Reviewed-by: Thomas Hellstrom 
Reviewed-by: Steven Price 
Reviewed-by: Jason Gunthorpe 
Signed-off-by: Jason Gunthorpe

mm/mincore.c: fix race between swapoff and mincore

2019-07-12T18:05:43+00:00

Via commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks"),
after swapoff, the address_space associated with the swap device will be
freed.  So swap_address_space() users which touch the address_space need
some kind of mechanism to prevent the address_space from being freed
during accessing.

When mincore processes an unmapped range for swapped shmem pages, it
doesn't hold the lock to prevent swap device from being swapped off.  So
the following race is possible:

CPU1					CPU2
do_mincore()				swapoff()
  walk_page_range()
    mincore_unmapped_range()
      __mincore_unmapped_range
        mincore_page
	  as = swap_address_space()
          ...				  exit_swap_address_space()
          ...				    kvfree(spaces)
	  find_get_page(as)

The address space may be accessed after being freed.

To fix the race, get_swap_device()/put_swap_device() is used to enclose
find_get_page() to check whether the swap entry is valid and prevent the
swap device from being swapoff during accessing.

Link: http://lkml.kernel.org/r/20190611020510.28251-1-ying.huang@intel.com
Fixes: 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks")
Signed-off-by: "Huang, Ying" 
Reviewed-by: Andrew Morton 
Acked-by: Michal Hocko 
Cc: Hugh Dickins 
Cc: Paul E. McKenney 
Cc: Minchan Kim 
Cc: Johannes Weiner 
Cc: Tim Chen 
Cc: Mel Gorman 
Cc: Jérôme Glisse 
Cc: Andrea Arcangeli 
Cc: Yang Shi 
Cc: David Rientjes 
Cc: Rik van Riel 
Cc: Jan Kara 
Cc: Dave Jiang 
Cc: Daniel Jordan 
Cc: Andrea Parri 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/mincore.c: make mincore() more conservative

2019-05-15T02:52:48+00:00

The semantics of what mincore() considers to be resident is not
completely clear, but Linux has always (since 2.3.52, which is when
mincore() was initially done) treated it as "page is available in page
cache".

That's potentially a problem, as that [in]directly exposes
meta-information about pagecache / memory mapping state even about
memory not strictly belonging to the process executing the syscall,
opening possibilities for sidechannel attacks.

Change the semantics of mincore() so that it only reveals pagecache
information for non-anonymous mappings that belog to files that the
calling process could (if it tried to) successfully open for writing;
otherwise we'd be including shared non-exclusive mappings, which

 - is the sidechannel

 - is not the usecase for mincore(), as that's primarily used for data,
   not (shared) text

[jkosina@suse.cz: v2]
  Link: http://lkml.kernel.org/r/20190312141708.6652-2-vbabka@suse.cz
[mhocko@suse.com: restructure can_do_mincore() conditions]
Link: http://lkml.kernel.org/r/nycvar.YFH.7.76.1903062342020.19912@cbobk.fhfr.pm
Signed-off-by: Jiri Kosina 
Signed-off-by: Vlastimil Babka 
Acked-by: Josh Snyder 
Acked-by: Michal Hocko 
Originally-by: Linus Torvalds 
Originally-by: Dominique Martinet 
Cc: Andy Lutomirski 
Cc: Dave Chinner 
Cc: Kevin Easton 
Cc: Matthew Wilcox 
Cc: Cyril Hrubis 
Cc: Tejun Heo 
Cc: Kirill A. Shutemov 
Cc: Daniel Gruss 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Revert "Change mincore() to count "mapped" pages rather than "cached" pages"

2019-01-23T20:04:37+00:00

This reverts commit 574823bfab82d9d8fa47f422778043fbb4b4f50e.

It turns out that my hope that we could just remove the code that
exposes the cache residency status from mincore() was too optimistic.

There are various random users that want it, and one example would be
the Netflix database cluster maintenance. To quote Josh Snyder:

 "For Netflix, losing accurate information from the mincore syscall
  would lengthen database cluster maintenance operations from days to
  months. We rely on cross-process mincore to migrate the contents of a
  page cache from machine to machine, and across reboots.

  To do this, I wrote and maintain happycache [1], a page cache
  dumper/loader tool. It is quite similar in architecture to pgfincore,
  except that it is agnostic to workload. The gist of happycache's
  operation is "produce a dump of residence status for each page, do
  some operation, then reload exactly the same pages which were present
  before." happycache is entirely dependent on accurate reporting of the
  in-core status of file-backed pages, as accessed by another process.

  We primarily use happycache with Cassandra, which (like Postgres +
  pgfincore) relies heavily on OS page cache to reduce disk accesses.
  Because our workloads never experience a cold page cache, we are able
  to provision hardware for a peak utilization level that is far lower
  than the hypothetical "every query is a cache miss" peak.

  A database warmed by happycache can be ready for service in seconds
  (bounded only by the performance of the drives and the I/O subsystem),
  with no period of in-service degradation. By contrast, putting a
  database in service without a page cache entails a potentially
  unbounded period of degradation (at Netflix, the time to populate a
  single node's cache via natural cache misses varies by workload from
  hours to weeks). If a single node upgrade were to take weeks, then
  upgrading an entire cluster would take months. Since we want to apply
  security upgrades (and other things) on a somewhat tighter schedule,
  we would have to develop more complex solutions to provide the same
  functionality already provided by mincore.

  At the bottom line, happycache is designed to benignly exploit the
  same information leak documented in the paper [2]. I think it makes
  perfect sense to remove cross-process mincore functionality from
  unprivileged users, but not to remove it entirely"

We do have an alternate approach that limits the cache residency
reporting only to processes that have write permissions to the file, so
we can fix the original information leak issue that way.  It involves
_adding_ code rather than removing it, which is sad, but hey, at least
we haven't found any users that would find the restrictions
unacceptable.

So revert the optimistic first approach to make room for that alternate
fix instead.

Reported-by: Josh Snyder 
Cc: Jiri Kosina 
Cc: Dominique Martinet 
Cc: Andy Lutomirski 
Cc: Dave Chinner 
Cc: Kevin Easton 
Cc: Matthew Wilcox 
Cc: Cyril Hrubis 
Cc: Vlastimil Babka 
Cc: Tejun Heo 
Cc: Kirill A. Shutemov 
Cc: Daniel Gruss 
Signed-off-by: Linus Torvalds