linux.git/fs/proc/task_mmu.c, branch v5.13

userfaultfd: add minor fault registration mode

2021-05-05T18:27:22+00:00

Patch series "userfaultfd: add minor fault handling", v9.

Overview
========

This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
When enabled (via the UFFDIO_API ioctl), this feature means that any
hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
get events for "minor" faults.  By "minor" fault, I mean the following
situation:

Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory).  One of the mappings is registered with userfaultfd (in minor
mode), and the other is not.  Via the non-UFFD mapping, the underlying
pages have already been allocated & filled with some contents.  The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault.  As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.

We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE.  The idea
is, userspace resolves the fault by either a) doing nothing if the
contents are already correct, or b) updating the underlying contents using
the second, non-UFFD mapping (via memcpy/memset or similar, or something
fancier like RDMA, or etc...).  In either case, userspace issues
UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
correct, carry on setting up the mapping".

Use Case
========

Consider the use case of VM live migration (e.g. under QEMU/KVM):

1. While a VM is still running, we copy the contents of its memory to a
   target machine. The pages are populated on the target by writing to the
   non-UFFD mapping, using the setup described above. The VM is still running
   (and therefore its memory is likely changing), so this may be repeated
   several times, until we decide the target is "up to date enough".

2. We pause the VM on the source, and start executing on the target machine.
   During this gap, the VM's user(s) will *see* a pause, so it is desirable to
   minimize this window.

3. Between the last time any page was copied from the source to the target, and
   when the VM was paused, the contents of that page may have changed - and
   therefore the copy we have on the target machine is out of date. Although we
   can keep track of which pages are out of date, for VMs with large amounts of
   memory, it is "slow" to transfer this information to the target machine. We
   want to resume execution before such a transfer would complete.

4. So, the guest begins executing on the target machine. The first time it
   touches its memory (via the UFFD-registered mapping), userspace wants to
   intercept this fault. Userspace checks whether or not the page is up to date,
   and if not, copies the updated page from the source machine, via the non-UFFD
   mapping. Finally, whether a copy was performed or not, userspace issues a
   UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
   are correct, carry on setting up the mapping".

We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.

Interaction with Existing APIs
==============================

Because this is a feature, a registered VMA could potentially receive both
missing and minor faults.  I spent some time thinking through how the
existing API interacts with the new feature:

UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page.  If UFFDIO_CONTINUE is used on a non-minor fault:

- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.

UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
Without modifications, the existing codepath assumes a new page needs to
be allocated.  This is okay, since userspace must have a second
non-UFFD-registered mapping anyway, thus there isn't much reason to want
to use these in any case (just memcpy or memset or similar).

- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
  in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
  -ENOENT in that case (regardless of the kind of fault).

Future Work
===========

This series only supports hugetlbfs.  I have a second series in flight to
support shmem as well, extending the functionality.  This series is more
mature than the shmem support at this point, and the functionality works
fully on hugetlbfs, so this series can be merged first and then shmem
support will follow.

This patch (of 6):

This feature allows userspace to intercept "minor" faults.  By "minor"
faults, I mean the following situation:

Let there exist two mappings (i.e., VMAs) to the same page(s).  One of the
mappings is registered with userfaultfd (in minor mode), and the other is
not.  Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents.  The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what
I'm calling a "minor" fault.  As a concrete example, when working with
hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
page.

This commit adds the new registration mode, and sets the relevant flag on
the VMAs being registered.  In the hugetlb fault path, if we find that we
have huge_pte_none(), but find_lock_page() does indeed find an existing
page, then we have a "minor" fault, and if the VMA has the userfaultfd
registration flag, we call into userfaultfd to handle it.

This is implemented as a new registration mode, instead of an API feature.
This is because the alternative implementation has significant drawbacks
[1].

However, doing it this was requires we allocate a VM_* flag for the new
registration mode.  On 32-bit systems, there are no unused bits, so this
feature is only supported on architectures with
CONFIG_ARCH_USES_HIGH_VMA_FLAGS.  When attempting to register a VMA in
MINOR mode on 32-bit architectures, we return -EINVAL.

[1] https://lore.kernel.org/patchwork/patch/1380226/

[peterx@redhat.com: fix minor fault page leak]
  Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com

Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen 
Reviewed-by: Peter Xu 
Reviewed-by: Mike Kravetz 
Cc: Alexander Viro 
Cc: Alexey Dobriyan 
Cc: Andrea Arcangeli 
Cc: Anshuman Khandual 
Cc: Catalin Marinas 
Cc: Chinwen Chang 
Cc: Huang Ying 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: Jerome Glisse 
Cc: Lokesh Gidra 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Michael Ellerman 
Cc: "Michal Koutn" 
Cc: Michel Lespinasse 
Cc: Mike Rapoport 
Cc: Nicholas Piggin 
Cc: Peter Xu 
Cc: Shaohua Li 
Cc: Shawn Anastasio 
Cc: Steven Rostedt 
Cc: Steven Price 
Cc: Vlastimil Babka 
Cc: Adam Ruprecht 
Cc: Axel Rasmussen 
Cc: Cannon Matthews 
Cc: "Dr . David Alan Gilbert" 
Cc: David Rientjes 
Cc: Mina Almasry 
Cc: Oliver Upton 
Cc: Kirill A. Shutemov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: use is_cow_mapping() across tree where proper

2021-03-13T19:27:30+00:00

After is_cow_mapping() is exported in mm.h, replace some manual checks
elsewhere throughout the tree but start to use the new helper.

Link: https://lkml.kernel.org/r/20210217233547.93892-5-peterx@redhat.com
Signed-off-by: Peter Xu 
Reviewed-by: Jason Gunthorpe 
Cc: VMware Graphics 
Cc: Roland Scheidegger 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Mike Kravetz 
Cc: Alexey Dobriyan 
Cc: Andrea Arcangeli 
Cc: Christoph Hellwig 
Cc: David Gibson 
Cc: Gal Pressman 
Cc: Jan Kara 
Cc: Jann Horn 
Cc: Kirill Shutemov 
Cc: Kirill Tkhai 
Cc: Matthew Wilcox 
Cc: Miaohe Lin 
Cc: Mike Rapoport 
Cc: Wei Zhang 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: proc: Invalidate TLB after clearing soft-dirty page state

2021-01-29T19:02:28+00:00

Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double
flush"), TLB invalidation is elided in tlb_finish_mmu() if no entries
were batched via the tlb_remove_*() functions. Consequently, the
page-table modifications performed by clear_refs_write() in response to
a write to /proc//clear_refs do not perform TLB invalidation.
Although this is fine when simply aging the ptes, in the case of
clearing the "soft-dirty" state we can end up with entries where
pte_write() is false, yet a writable mapping remains in the TLB.

Fix this by avoiding the mmu_gather API altogether: managing both the
'tlb_flush_pending' flag on the 'mm_struct' and explicit TLB
invalidation for the sort-dirty path, much like mprotect() does already.

Fixes: 0758cd830494 ("asm-generic/tlb: avoid potential double flush”)
Signed-off-by: Will Deacon 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Yu Zhao 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Linus Torvalds 
Link: https://lkml.kernel.org/r/20210127235347.1402-2-will@kernel.org

mm: don't play games with pinned pages in clear_page_refs

2021-01-16T18:51:26+00:00

Turning a pinned page read-only breaks the pinning after COW.  Don't do it.

The whole "track page soft dirty" state doesn't work with pinned pages
anyway, since the page might be dirtied by the pinning entity without
ever being noticed in the page tables.

Signed-off-by: Linus Torvalds

mm: fix clear_refs_write locking

2021-01-16T18:46:39+00:00

Turning page table entries read-only requires the mmap_sem held for
writing.

So stop doing the odd games with turning things from read locks to write
locks and back.  Just get the write lock.

Signed-off-by: Linus Torvalds

proc: use untagged_addr() for pagemap_read addresses

2020-12-11T22:02:14+00:00

When we try to visit the pagemap of a tagged userspace pointer, we find
that the start_vaddr is not correct because of the tag.
To fix it, we should untag the userspace pointers in pagemap_read().

I tested with 5.10-rc4 and the issue remains.

Explanation from Catalin in [1]:

 "Arguably, that's a user-space bug since tagged file offsets were never
  supported. In this case it's not even a tag at bit 56 as per the arm64
  tagged address ABI but rather down to bit 47. You could say that the
  problem is caused by the C library (malloc()) or whoever created the
  tagged vaddr and passed it to this function. It's not a kernel
  regression as we've never supported it.

  Now, pagemap is a special case where the offset is usually not
  generated as a classic file offset but rather derived by shifting a
  user virtual address. I guess we can make a concession for pagemap
  (only) and allow such offset with the tag at bit (56 - PAGE_SHIFT + 3)"

My test code is based on [2]:

A userspace pointer which has been tagged by 0xb4: 0xb400007662f541c8

userspace program:

  uint64 OsLayer::VirtualToPhysical(void *vaddr) {
	uint64 frame, paddr, pfnmask, pagemask;
	int pagesize = sysconf(_SC_PAGESIZE);
	off64_t off = ((uintptr_t)vaddr) / pagesize * 8; // off = 0xb400007662f541c8 / pagesize * 8 = 0x5a00003b317aa0
	int fd = open(kPagemapPath, O_RDONLY);
	...

	if (lseek64(fd, off, SEEK_SET) != off || read(fd, &frame, 8) != 8) {
		int err = errno;
		string errtxt = ErrorString(err);
		if (fd >= 0)
			close(fd);
		return 0;
	}
  ...
  }

kernel fs/proc/task_mmu.c:

  static ssize_t pagemap_read(struct file *file, char __user *buf,
		size_t count, loff_t *ppos)
  {
	...
	src = *ppos;
	svpfn = src / PM_ENTRY_BYTES; // svpfn == 0xb400007662f54
	start_vaddr = svpfn << PAGE_SHIFT; // start_vaddr == 0xb400007662f54000
	end_vaddr = mm->task_size;

	/* watch out for wraparound */
	// svpfn == 0xb400007662f54
	// (mm->task_size >> PAGE) == 0x8000000
	if (svpfn > mm->task_size >> PAGE_SHIFT) // the condition is true because of the tag 0xb4
		start_vaddr = end_vaddr;

	ret = 0;
	while (count && (start_vaddr < end_vaddr)) { // we cannot visit correct entry because start_vaddr is set to end_vaddr
		int len;
		unsigned long end;
		...
	}
	...
  }

[1] https://lore.kernel.org/patchwork/patch/1343258/
[2] https://github.com/stressapptest/stressapptest/blob/master/src/os.cc#L158

Link: https://lkml.kernel.org/r/20201204024347.8295-1-miles.chen@mediatek.com
Signed-off-by: Miles Chen 
Reviewed-by: Vincenzo Frascino 
Reviewed-by: Catalin Marinas 
Cc: Alexey Dobriyan 
Cc: Andrey Konovalov 
Cc: Alexander Potapenko 
Cc: Vincenzo Frascino 
Cc: Andrey Ryabinin 
Cc: Catalin Marinas 
Cc: Dmitry Vyukov 
Cc: Marco Elver 
Cc: Will Deacon 
Cc: Eric W. Biederman 
Cc: Song Bao Hua (Barry Song) 
Cc: 	[5.4-]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: remove the now-unnecessary mmget_still_valid() hack

2020-10-16T18:11:22+00:00

The preceding patches have ensured that core dumping properly takes the
mmap_lock.  Thanks to that, we can now remove mmget_still_valid() and all
its users.

Signed-off-by: Jann Horn 
Signed-off-by: Andrew Morton 
Acked-by: Linus Torvalds 
Cc: Christoph Hellwig 
Cc: Alexander Viro 
Cc: "Eric W . Biederman" 
Cc: Oleg Nesterov 
Cc: Hugh Dickins 
Link: http://lkml.kernel.org/r/20200827114932.3572699-8-jannh@google.com
Signed-off-by: Linus Torvalds

mm: proc: smaps_rollup: do not stall write attempts on mmap_lock

2020-10-14T01:38:31+00:00

smaps_rollup will try to grab mmap_lock and go through the whole vma list
until it finishes the iterating.  When encountering large processes, the
mmap_lock will be held for a longer time, which may block other write
requests like mmap and munmap from progressing smoothly.

There are upcoming mmap_lock optimizations like range-based locks, but the
lock applied to smaps_rollup would be the coarse type, which doesn't avoid
the occurrence of unpleasant contention.

To solve aforementioned issue, we add a check which detects whether anyone
wants to grab mmap_lock for write attempts.

Signed-off-by: Chinwen Chang 
Signed-off-by: Andrew Morton 
Cc: Steven Price 
Cc: Michel Lespinasse 
Cc: Matthias Brugger 
Cc: Vlastimil Babka 
Cc: Daniel Jordan 
Cc: Davidlohr Bueso 
Cc: Chinwen Chang 
Cc: Alexey Dobriyan 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Jason Gunthorpe 
Cc: Song Liu 
Cc: Jimmy Assarsson 
Cc: Huang Ying 
Cc: Daniel Kiss 
Cc: Laurent Dufour 
Link: http://lkml.kernel.org/r/1597715898-3854-4-git-send-email-chinwen.chang@mediatek.com
Signed-off-by: Linus Torvalds

mm: smaps*: extend smap_gather_stats to support specified beginning

2020-10-14T01:38:31+00:00

Extend smap_gather_stats to support indicated beginning address at which
it should start gathering.  To achieve the goal, we add a new parameter
@start assigned by the caller and try to refactor it for simplicity.

If @start is 0, it will use the range of @vma for gathering.

Signed-off-by: Chinwen Chang 
Signed-off-by: Andrew Morton 
Reviewed-by: Steven Price 
Cc: Michel Lespinasse 
Cc: Alexey Dobriyan 
Cc: Daniel Jordan 
Cc: Daniel Kiss 
Cc: Davidlohr Bueso 
Cc: Huang Ying 
Cc: Jason Gunthorpe 
Cc: Jimmy Assarsson 
Cc: Laurent Dufour 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Matthias Brugger 
Cc: Song Liu 
Cc: Vlastimil Babka 
Link: http://lkml.kernel.org/r/1597715898-3854-3-git-send-email-chinwen.chang@mediatek.com
Signed-off-by: Linus Torvalds

proc: optimise smaps for shmem entries

2020-10-14T01:38:29+00:00

Avoid bumping the refcount on pages when we're only interested in the
swap entries.

Signed-off-by: Matthew Wilcox (Oracle) 
Signed-off-by: Andrew Morton 
Acked-by: Johannes Weiner 
Cc: Alexey Dobriyan 
Cc: Chris Wilson 
Cc: Huang Ying 
Cc: Hugh Dickins 
Cc: Jani Nikula 
Cc: Matthew Auld 
Cc: William Kucharski 
Link: https://lkml.kernel.org/r/20200910183318.20139-5-willy@infradead.org
Signed-off-by: Linus Torvalds