linux.git/fs/proc/task_mmu.c, branch v5.12

mm: use is_cow_mapping() across tree where proper

2021-03-13T19:27:30+00:00

After is_cow_mapping() is exported in mm.h, replace some manual checks
elsewhere throughout the tree but start to use the new helper.

Link: https://lkml.kernel.org/r/20210217233547.93892-5-peterx@redhat.com
Signed-off-by: Peter Xu 
Reviewed-by: Jason Gunthorpe 
Cc: VMware Graphics 
Cc: Roland Scheidegger 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Mike Kravetz 
Cc: Alexey Dobriyan 
Cc: Andrea Arcangeli 
Cc: Christoph Hellwig 
Cc: David Gibson 
Cc: Gal Pressman 
Cc: Jan Kara 
Cc: Jann Horn 
Cc: Kirill Shutemov 
Cc: Kirill Tkhai 
Cc: Matthew Wilcox 
Cc: Miaohe Lin 
Cc: Mike Rapoport 
Cc: Wei Zhang 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: proc: Invalidate TLB after clearing soft-dirty page state

2021-01-29T19:02:28+00:00

Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double
flush"), TLB invalidation is elided in tlb_finish_mmu() if no entries
were batched via the tlb_remove_*() functions. Consequently, the
page-table modifications performed by clear_refs_write() in response to
a write to /proc//clear_refs do not perform TLB invalidation.
Although this is fine when simply aging the ptes, in the case of
clearing the "soft-dirty" state we can end up with entries where
pte_write() is false, yet a writable mapping remains in the TLB.

Fix this by avoiding the mmu_gather API altogether: managing both the
'tlb_flush_pending' flag on the 'mm_struct' and explicit TLB
invalidation for the sort-dirty path, much like mprotect() does already.

Fixes: 0758cd830494 ("asm-generic/tlb: avoid potential double flush”)
Signed-off-by: Will Deacon 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Yu Zhao 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Linus Torvalds 
Link: https://lkml.kernel.org/r/20210127235347.1402-2-will@kernel.org

mm: don't play games with pinned pages in clear_page_refs

2021-01-16T18:51:26+00:00

Turning a pinned page read-only breaks the pinning after COW.  Don't do it.

The whole "track page soft dirty" state doesn't work with pinned pages
anyway, since the page might be dirtied by the pinning entity without
ever being noticed in the page tables.

Signed-off-by: Linus Torvalds

mm: fix clear_refs_write locking

2021-01-16T18:46:39+00:00

Turning page table entries read-only requires the mmap_sem held for
writing.

So stop doing the odd games with turning things from read locks to write
locks and back.  Just get the write lock.

Signed-off-by: Linus Torvalds

proc: use untagged_addr() for pagemap_read addresses

2020-12-11T22:02:14+00:00

When we try to visit the pagemap of a tagged userspace pointer, we find
that the start_vaddr is not correct because of the tag.
To fix it, we should untag the userspace pointers in pagemap_read().

I tested with 5.10-rc4 and the issue remains.

Explanation from Catalin in [1]:

 "Arguably, that's a user-space bug since tagged file offsets were never
  supported. In this case it's not even a tag at bit 56 as per the arm64
  tagged address ABI but rather down to bit 47. You could say that the
  problem is caused by the C library (malloc()) or whoever created the
  tagged vaddr and passed it to this function. It's not a kernel
  regression as we've never supported it.

  Now, pagemap is a special case where the offset is usually not
  generated as a classic file offset but rather derived by shifting a
  user virtual address. I guess we can make a concession for pagemap
  (only) and allow such offset with the tag at bit (56 - PAGE_SHIFT + 3)"

My test code is based on [2]:

A userspace pointer which has been tagged by 0xb4: 0xb400007662f541c8

userspace program:

  uint64 OsLayer::VirtualToPhysical(void *vaddr) {
	uint64 frame, paddr, pfnmask, pagemask;
	int pagesize = sysconf(_SC_PAGESIZE);
	off64_t off = ((uintptr_t)vaddr) / pagesize * 8; // off = 0xb400007662f541c8 / pagesize * 8 = 0x5a00003b317aa0
	int fd = open(kPagemapPath, O_RDONLY);
	...

	if (lseek64(fd, off, SEEK_SET) != off || read(fd, &frame, 8) != 8) {
		int err = errno;
		string errtxt = ErrorString(err);
		if (fd >= 0)
			close(fd);
		return 0;
	}
  ...
  }

kernel fs/proc/task_mmu.c:

  static ssize_t pagemap_read(struct file *file, char __user *buf,
		size_t count, loff_t *ppos)
  {
	...
	src = *ppos;
	svpfn = src / PM_ENTRY_BYTES; // svpfn == 0xb400007662f54
	start_vaddr = svpfn << PAGE_SHIFT; // start_vaddr == 0xb400007662f54000
	end_vaddr = mm->task_size;

	/* watch out for wraparound */
	// svpfn == 0xb400007662f54
	// (mm->task_size >> PAGE) == 0x8000000
	if (svpfn > mm->task_size >> PAGE_SHIFT) // the condition is true because of the tag 0xb4
		start_vaddr = end_vaddr;

	ret = 0;
	while (count && (start_vaddr < end_vaddr)) { // we cannot visit correct entry because start_vaddr is set to end_vaddr
		int len;
		unsigned long end;
		...
	}
	...
  }

[1] https://lore.kernel.org/patchwork/patch/1343258/
[2] https://github.com/stressapptest/stressapptest/blob/master/src/os.cc#L158

Link: https://lkml.kernel.org/r/20201204024347.8295-1-miles.chen@mediatek.com
Signed-off-by: Miles Chen 
Reviewed-by: Vincenzo Frascino 
Reviewed-by: Catalin Marinas 
Cc: Alexey Dobriyan 
Cc: Andrey Konovalov 
Cc: Alexander Potapenko 
Cc: Vincenzo Frascino 
Cc: Andrey Ryabinin 
Cc: Catalin Marinas 
Cc: Dmitry Vyukov 
Cc: Marco Elver 
Cc: Will Deacon 
Cc: Eric W. Biederman 
Cc: Song Bao Hua (Barry Song) 
Cc: 	[5.4-]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: remove the now-unnecessary mmget_still_valid() hack

2020-10-16T18:11:22+00:00

The preceding patches have ensured that core dumping properly takes the
mmap_lock.  Thanks to that, we can now remove mmget_still_valid() and all
its users.

Signed-off-by: Jann Horn 
Signed-off-by: Andrew Morton 
Acked-by: Linus Torvalds 
Cc: Christoph Hellwig 
Cc: Alexander Viro 
Cc: "Eric W . Biederman" 
Cc: Oleg Nesterov 
Cc: Hugh Dickins 
Link: http://lkml.kernel.org/r/20200827114932.3572699-8-jannh@google.com
Signed-off-by: Linus Torvalds

mm: proc: smaps_rollup: do not stall write attempts on mmap_lock

2020-10-14T01:38:31+00:00

smaps_rollup will try to grab mmap_lock and go through the whole vma list
until it finishes the iterating.  When encountering large processes, the
mmap_lock will be held for a longer time, which may block other write
requests like mmap and munmap from progressing smoothly.

There are upcoming mmap_lock optimizations like range-based locks, but the
lock applied to smaps_rollup would be the coarse type, which doesn't avoid
the occurrence of unpleasant contention.

To solve aforementioned issue, we add a check which detects whether anyone
wants to grab mmap_lock for write attempts.

Signed-off-by: Chinwen Chang 
Signed-off-by: Andrew Morton 
Cc: Steven Price 
Cc: Michel Lespinasse 
Cc: Matthias Brugger 
Cc: Vlastimil Babka 
Cc: Daniel Jordan 
Cc: Davidlohr Bueso 
Cc: Chinwen Chang 
Cc: Alexey Dobriyan 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Jason Gunthorpe 
Cc: Song Liu 
Cc: Jimmy Assarsson 
Cc: Huang Ying 
Cc: Daniel Kiss 
Cc: Laurent Dufour 
Link: http://lkml.kernel.org/r/1597715898-3854-4-git-send-email-chinwen.chang@mediatek.com
Signed-off-by: Linus Torvalds

mm: smaps*: extend smap_gather_stats to support specified beginning

2020-10-14T01:38:31+00:00

Extend smap_gather_stats to support indicated beginning address at which
it should start gathering.  To achieve the goal, we add a new parameter
@start assigned by the caller and try to refactor it for simplicity.

If @start is 0, it will use the range of @vma for gathering.

Signed-off-by: Chinwen Chang 
Signed-off-by: Andrew Morton 
Reviewed-by: Steven Price 
Cc: Michel Lespinasse 
Cc: Alexey Dobriyan 
Cc: Daniel Jordan 
Cc: Daniel Kiss 
Cc: Davidlohr Bueso 
Cc: Huang Ying 
Cc: Jason Gunthorpe 
Cc: Jimmy Assarsson 
Cc: Laurent Dufour 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Matthias Brugger 
Cc: Song Liu 
Cc: Vlastimil Babka 
Link: http://lkml.kernel.org/r/1597715898-3854-3-git-send-email-chinwen.chang@mediatek.com
Signed-off-by: Linus Torvalds

proc: optimise smaps for shmem entries

2020-10-14T01:38:29+00:00

Avoid bumping the refcount on pages when we're only interested in the
swap entries.

Signed-off-by: Matthew Wilcox (Oracle) 
Signed-off-by: Andrew Morton 
Acked-by: Johannes Weiner 
Cc: Alexey Dobriyan 
Cc: Chris Wilson 
Cc: Huang Ying 
Cc: Hugh Dickins 
Cc: Jani Nikula 
Cc: Matthew Auld 
Cc: William Kucharski 
Link: https://lkml.kernel.org/r/20200910183318.20139-5-willy@infradead.org
Signed-off-by: Linus Torvalds

arm64: mte: Add PROT_MTE support to mmap() and mprotect()

2020-09-04T11:46:07+00:00

To enable tagging on a memory range, the user must explicitly opt in via
a new PROT_MTE flag passed to mmap() or mprotect(). Since this is a new
memory type in the AttrIndx field of a pte, simplify the or'ing of these
bits over the protection_map[] attributes by making MT_NORMAL index 0.

There are two conditions for arch_vm_get_page_prot() to return the
MT_NORMAL_TAGGED memory type: (1) the user requested it via PROT_MTE,
registered as VM_MTE in the vm_flags, and (2) the vma supports MTE,
decided during the mmap() call (only) and registered as VM_MTE_ALLOWED.

arch_calc_vm_prot_bits() is responsible for registering the user request
as VM_MTE. The newly introduced arch_calc_vm_flag_bits() sets
VM_MTE_ALLOWED if the mapping is MAP_ANONYMOUS. An MTE-capable
filesystem (RAM-based) may be able to set VM_MTE_ALLOWED during its
mmap() file ops call.

In addition, update VM_DATA_DEFAULT_FLAGS to allow mprotect(PROT_MTE) on
stack or brk area.

The Linux mmap() syscall currently ignores unknown PROT_* flags. In the
presence of MTE, an mmap(PROT_MTE) on a file which does not support MTE
will not report an error and the memory will not be mapped as Normal
Tagged. For consistency, mprotect(PROT_MTE) will not report an error
either if the memory range does not support MTE. Two subsequent patches
in the series will propose tightening of this behaviour.

Co-developed-by: Vincenzo Frascino 
Signed-off-by: Vincenzo Frascino 
Signed-off-by: Catalin Marinas 
Cc: Will Deacon