linux-stable.git/fs/userfaultfd.c, branch v5.13.2

userfaultfd: add UFFDIO_CONTINUE ioctl

2021-05-05T18:27:22+00:00

This ioctl is how userspace ought to resolve "minor" userfaults.  The
idea is, userspace is notified that a minor fault has occurred.  It
might change the contents of the page using its second non-UFFD mapping,
or not.  Then, it calls UFFDIO_CONTINUE to tell the kernel "I have
ensured the page contents are correct, carry on setting up the mapping".

Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
MINOR registered VMAs.  ZEROPAGE maps the VMA to the zero page; but in
the minor fault case, we already have some pre-existing underlying page.
Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
We'd just use memcpy() or similar instead.

It turns out hugetlb_mcopy_atomic_pte() already does very close to what
we want, if an existing page is provided via `struct page **pagep`.  We
already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
just extend that design: add an enum for the three modes of operation,
and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
case.  (Basically, look up the existing page, and avoid adding the
existing page to the page cache or calling set_page_huge_active() on
it.)

Link: https://lkml.kernel.org/r/20210301222728.176417-5-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen 
Reviewed-by: Peter Xu 
Cc: Adam Ruprecht 
Cc: Alexander Viro 
Cc: Alexey Dobriyan 
Cc: Andrea Arcangeli 
Cc: Anshuman Khandual 
Cc: Cannon Matthews 
Cc: Catalin Marinas 
Cc: Chinwen Chang 
Cc: David Rientjes 
Cc: "Dr . David Alan Gilbert" 
Cc: Huang Ying 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: Jerome Glisse 
Cc: Kirill A. Shutemov 
Cc: Lokesh Gidra 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Michael Ellerman 
Cc: "Michal Koutn" 
Cc: Michel Lespinasse 
Cc: Mike Kravetz 
Cc: Mike Rapoport 
Cc: Mina Almasry 
Cc: Nicholas Piggin 
Cc: Oliver Upton 
Cc: Shaohua Li 
Cc: Shawn Anastasio 
Cc: Steven Price 
Cc: Steven Rostedt 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

userfaultfd: add minor fault registration mode

2021-05-05T18:27:22+00:00

Patch series "userfaultfd: add minor fault handling", v9.

Overview
========

This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
When enabled (via the UFFDIO_API ioctl), this feature means that any
hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
get events for "minor" faults.  By "minor" fault, I mean the following
situation:

Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
memory).  One of the mappings is registered with userfaultfd (in minor
mode), and the other is not.  Via the non-UFFD mapping, the underlying
pages have already been allocated & filled with some contents.  The UFFD
mapping has not yet been faulted in; when it is touched for the first
time, this results in what I'm calling a "minor" fault.  As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.

We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE.  The idea
is, userspace resolves the fault by either a) doing nothing if the
contents are already correct, or b) updating the underlying contents using
the second, non-UFFD mapping (via memcpy/memset or similar, or something
fancier like RDMA, or etc...).  In either case, userspace issues
UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
correct, carry on setting up the mapping".

Use Case
========

Consider the use case of VM live migration (e.g. under QEMU/KVM):

1. While a VM is still running, we copy the contents of its memory to a
   target machine. The pages are populated on the target by writing to the
   non-UFFD mapping, using the setup described above. The VM is still running
   (and therefore its memory is likely changing), so this may be repeated
   several times, until we decide the target is "up to date enough".

2. We pause the VM on the source, and start executing on the target machine.
   During this gap, the VM's user(s) will *see* a pause, so it is desirable to
   minimize this window.

3. Between the last time any page was copied from the source to the target, and
   when the VM was paused, the contents of that page may have changed - and
   therefore the copy we have on the target machine is out of date. Although we
   can keep track of which pages are out of date, for VMs with large amounts of
   memory, it is "slow" to transfer this information to the target machine. We
   want to resume execution before such a transfer would complete.

4. So, the guest begins executing on the target machine. The first time it
   touches its memory (via the UFFD-registered mapping), userspace wants to
   intercept this fault. Userspace checks whether or not the page is up to date,
   and if not, copies the updated page from the source machine, via the non-UFFD
   mapping. Finally, whether a copy was performed or not, userspace issues a
   UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
   are correct, carry on setting up the mapping".

We don't have to do all of the final updates on-demand. The userfaultfd manager
can, in the background, also copy over updated pages once it receives the map of
which pages are up-to-date or not.

Interaction with Existing APIs
==============================

Because this is a feature, a registered VMA could potentially receive both
missing and minor faults.  I spent some time thinking through how the
existing API interacts with the new feature:

UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
allocate a new page.  If UFFDIO_CONTINUE is used on a non-minor fault:

- For non-shared memory or shmem, -EINVAL is returned.
- For hugetlb, -EFAULT is returned.

UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
Without modifications, the existing codepath assumes a new page needs to
be allocated.  This is okay, since userspace must have a second
non-UFFD-registered mapping anyway, thus there isn't much reason to want
to use these in any case (just memcpy or memset or similar).

- If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
- If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
  in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
- UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
  -ENOENT in that case (regardless of the kind of fault).

Future Work
===========

This series only supports hugetlbfs.  I have a second series in flight to
support shmem as well, extending the functionality.  This series is more
mature than the shmem support at this point, and the functionality works
fully on hugetlbfs, so this series can be merged first and then shmem
support will follow.

This patch (of 6):

This feature allows userspace to intercept "minor" faults.  By "minor"
faults, I mean the following situation:

Let there exist two mappings (i.e., VMAs) to the same page(s).  One of the
mappings is registered with userfaultfd (in minor mode), and the other is
not.  Via the non-UFFD mapping, the underlying pages have already been
allocated & filled with some contents.  The UFFD mapping has not yet been
faulted in; when it is touched for the first time, this results in what
I'm calling a "minor" fault.  As a concrete example, when working with
hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
page.

This commit adds the new registration mode, and sets the relevant flag on
the VMAs being registered.  In the hugetlb fault path, if we find that we
have huge_pte_none(), but find_lock_page() does indeed find an existing
page, then we have a "minor" fault, and if the VMA has the userfaultfd
registration flag, we call into userfaultfd to handle it.

This is implemented as a new registration mode, instead of an API feature.
This is because the alternative implementation has significant drawbacks
[1].

However, doing it this was requires we allocate a VM_* flag for the new
registration mode.  On 32-bit systems, there are no unused bits, so this
feature is only supported on architectures with
CONFIG_ARCH_USES_HIGH_VMA_FLAGS.  When attempting to register a VMA in
MINOR mode on 32-bit architectures, we return -EINVAL.

[1] https://lore.kernel.org/patchwork/patch/1380226/

[peterx@redhat.com: fix minor fault page leak]
  Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com

Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen 
Reviewed-by: Peter Xu 
Reviewed-by: Mike Kravetz 
Cc: Alexander Viro 
Cc: Alexey Dobriyan 
Cc: Andrea Arcangeli 
Cc: Anshuman Khandual 
Cc: Catalin Marinas 
Cc: Chinwen Chang 
Cc: Huang Ying 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: Jerome Glisse 
Cc: Lokesh Gidra 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Michael Ellerman 
Cc: "Michal Koutn" 
Cc: Michel Lespinasse 
Cc: Mike Rapoport 
Cc: Nicholas Piggin 
Cc: Peter Xu 
Cc: Shaohua Li 
Cc: Shawn Anastasio 
Cc: Steven Rostedt 
Cc: Steven Price 
Cc: Vlastimil Babka 
Cc: Adam Ruprecht 
Cc: Axel Rasmussen 
Cc: Cannon Matthews 
Cc: "Dr . David Alan Gilbert" 
Cc: David Rientjes 
Cc: Mina Almasry 
Cc: Oliver Upton 
Cc: Kirill A. Shutemov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

hugetlb/userfaultfd: unshare all pmds for hugetlbfs when register wp

2021-05-05T18:27:20+00:00

Huge pmd sharing for hugetlbfs is racy with userfaultfd-wp because
userfaultfd-wp is always based on pgtable entries, so they cannot be
shared.

Walk the hugetlb range and unshare all such mappings if there is, right
before UFFDIO_REGISTER will succeed and return to userspace.

This will pair with want_pmd_share() in hugetlb code so that huge pmd
sharing is completely disabled for userfaultfd-wp registered range.

Link: https://lkml.kernel.org/r/20210218231206.15524-1-peterx@redhat.com
Signed-off-by: Peter Xu 
Reviewed-by: Mike Kravetz 
Cc: Peter Xu 
Cc: Andrea Arcangeli 
Cc: Axel Rasmussen 
Cc: Mike Rapoport 
Cc: Kirill A. Shutemov 
Cc: Matthew Wilcox (Oracle) 
Cc: Adam Ruprecht 
Cc: Alexander Viro 
Cc: Alexey Dobriyan 
Cc: Anshuman Khandual 
Cc: Cannon Matthews 
Cc: Catalin Marinas 
Cc: Chinwen Chang 
Cc: David Rientjes 
Cc: "Dr . David Alan Gilbert" 
Cc: Huang Ying 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: Jerome Glisse 
Cc: Lokesh Gidra 
Cc: Michael Ellerman 
Cc: "Michal Koutn" 
Cc: Michel Lespinasse 
Cc: Mina Almasry 
Cc: Nicholas Piggin 
Cc: Oliver Upton 
Cc: Shaohua Li 
Cc: Shawn Anastasio 
Cc: Steven Price 
Cc: Steven Rostedt 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

userfaultfd: use secure anon inodes for userfaultfd

2021-01-14T22:40:57+00:00

This change gives userfaultfd file descriptors a real security
context, allowing policy to act on them.

Signed-off-by: Daniel Colascione 
[LG: Remove owner inode from userfaultfd_ctx]
[LG: Use anon_inode_getfd_secure() in userfaultfd syscall]
[LG: Use inode of file in userfaultfd_read() in resolve_userfault_fork()]
Signed-off-by: Lokesh Gidra 
Reviewed-by: Eric Biggers 
Signed-off-by: Paul Moore

userfaultfd: add user-mode only option to unprivileged_userfaultfd sysctl knob

2020-12-15T20:13:46+00:00

With this change, when the knob is set to 0, it allows unprivileged users
to call userfaultfd, like when it is set to 1, but with the restriction
that page faults from only user-mode can be handled.  In this mode, an
unprivileged user (without SYS_CAP_PTRACE capability) must pass
UFFD_USER_MODE_ONLY to userfaultd or the API will fail with EPERM.

This enables administrators to reduce the likelihood that an attacker with
access to userfaultfd can delay faulting kernel code to widen timing
windows for other exploits.

The default value of this knob is changed to 0.  This is required for
correct functioning of pipe mutex.  However, this will fail postcopy live
migration, which will be unnoticeable to the VM guests.  To avoid this,
set 'vm.userfault = 1' in /sys/sysctl.conf.

The main reason this change is desirable as in the short term is that the
Android userland will behave as with the sysctl set to zero.  So without
this commit, any Linux binary using userfaultfd to manage its memory would
behave differently if run within the Android userland.  For more details,
refer to Andrea's reply [1].

[1] https://lore.kernel.org/lkml/20200904033438.GI9411@redhat.com/

Link: https://lkml.kernel.org/r/20201120030411.2690816-3-lokeshgidra@google.com
Signed-off-by: Lokesh Gidra 
Reviewed-by: Andrea Arcangeli 
Cc: Kees Cook 
Cc: Jonathan Corbet 
Cc: Peter Xu 
Cc: Sebastian Andrzej Siewior 
Cc: Alexander Viro 
Cc: Stephen Smalley 
Cc: Eric Biggers 
Cc: Daniel Colascione 
Cc: "Joel Fernandes (Google)" 
Cc: Kalesh Singh 
Cc: Suren Baghdasaryan 
Cc: Jeff Vander Stoep 
Cc: 
Cc: Mike Rapoport 
Cc: Shaohua Li 
Cc: Jerome Glisse 
Cc: Mauro Carvalho Chehab 
Cc: Johannes Weiner 
Cc: Mel Gorman 
Cc: Nitin Gupta 
Cc: Vlastimil Babka 
Cc: Iurii Zaikin 
Cc: Luis Chamberlain 
Cc: Daniel Colascione 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

userfaultfd: add UFFD_USER_MODE_ONLY

2020-12-15T20:13:46+00:00

Patch series "Control over userfaultfd kernel-fault handling", v6.

This patch series is split from [1].  The other series enables SELinux
support for userfaultfd file descriptors so that its creation and movement
can be controlled.

It has been demonstrated on various occasions that suspending kernel code
execution for an arbitrary amount of time at any access to userspace
memory (copy_from_user()/copy_to_user()/...) can be exploited to change
the intended behavior of the kernel.  For instance, handling page faults
in kernel-mode using userfaultfd has been exploited in [2, 3].  Likewise,
FUSE, which is similar to userfaultfd in this respect, has been exploited
in [4, 5] for similar outcome.

This small patch series adds a new flag to userfaultfd(2) that allows
callers to give up the ability to handle kernel-mode faults with the
resulting UFFD file object.  It then adds a 'user-mode only' option to the
unprivileged_userfaultfd sysctl knob to require unprivileged callers to
use this new flag.

The purpose of this new interface is to decrease the chance of an
unprivileged userfaultfd user taking advantage of userfaultfd to enhance
security vulnerabilities by lengthening the race window in kernel code.

[1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
[2] https://duasynt.com/blog/linux-kernel-heap-spray
[3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit
[4] https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
[5] https://bugs.chromium.org/p/project-zero/issues/detail?id=808

This patch (of 2):

userfaultfd handles page faults from both user and kernel code.  Add a new
UFFD_USER_MODE_ONLY flag for userfaultfd(2) that makes the resulting
userfaultfd object refuse to handle faults from kernel mode, treating
these faults as if SIGBUS were always raised, causing the kernel code to
fail with EFAULT.

A future patch adds a knob allowing administrators to give some processes
the ability to create userfaultfd file objects only if they pass
UFFD_USER_MODE_ONLY, reducing the likelihood that these processes will
exploit userfaultfd's ability to delay kernel page faults to open timing
windows for future exploits.

Link: https://lkml.kernel.org/r/20201120030411.2690816-1-lokeshgidra@google.com
Link: https://lkml.kernel.org/r/20201120030411.2690816-2-lokeshgidra@google.com
Signed-off-by: Daniel Colascione 
Signed-off-by: Lokesh Gidra 
Reviewed-by: Andrea Arcangeli 
Cc: Alexander Viro 
Cc: 
Cc: Daniel Colascione 
Cc: Eric Biggers 
Cc: Iurii Zaikin 
Cc: Jeff Vander Stoep 
Cc: Jerome Glisse 
Cc: "Joel Fernandes (Google)" 
Cc: Johannes Weiner 
Cc: Jonathan Corbet 
Cc: Kalesh Singh 
Cc: Kees Cook 
Cc: Luis Chamberlain 
Cc: Mauro Carvalho Chehab 
Cc: Mel Gorman 
Cc: Mike Rapoport 
Cc: Nitin Gupta 
Cc: Peter Xu 
Cc: Sebastian Andrzej Siewior 
Cc: Shaohua Li 
Cc: Stephen Smalley 
Cc: Suren Baghdasaryan 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: remove the now-unnecessary mmget_still_valid() hack

2020-10-16T18:11:22+00:00

The preceding patches have ensured that core dumping properly takes the
mmap_lock.  Thanks to that, we can now remove mmget_still_valid() and all
its users.

Signed-off-by: Jann Horn 
Signed-off-by: Andrew Morton 
Acked-by: Linus Torvalds 
Cc: Christoph Hellwig 
Cc: Alexander Viro 
Cc: "Eric W . Biederman" 
Cc: Oleg Nesterov 
Cc: Hugh Dickins 
Link: http://lkml.kernel.org/r/20200827114932.3572699-8-jannh@google.com
Signed-off-by: Linus Torvalds

Merge tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2020-08-11T02:07:44+00:00

Pull locking updates from Thomas Gleixner:
 "A set of locking fixes and updates:

   - Untangle the header spaghetti which causes build failures in
     various situations caused by the lockdep additions to seqcount to
     validate that the write side critical sections are non-preemptible.

   - The seqcount associated lock debug addons which were blocked by the
     above fallout.

     seqcount writers contrary to seqlock writers must be externally
     serialized, which usually happens via locking - except for strict
     per CPU seqcounts. As the lock is not part of the seqcount, lockdep
     cannot validate that the lock is held.

     This new debug mechanism adds the concept of associated locks.
     sequence count has now lock type variants and corresponding
     initializers which take a pointer to the associated lock used for
     writer serialization. If lockdep is enabled the pointer is stored
     and write_seqcount_begin() has a lockdep assertion to validate that
     the lock is held.

     Aside of the type and the initializer no other code changes are
     required at the seqcount usage sites. The rest of the seqcount API
     is unchanged and determines the type at compile time with the help
     of _Generic which is possible now that the minimal GCC version has
     been moved up.

     Adding this lockdep coverage unearthed a handful of seqcount bugs
     which have been addressed already independent of this.

     While generally useful this comes with a Trojan Horse twist: On RT
     kernels the write side critical section can become preemtible if
     the writers are serialized by an associated lock, which leads to
     the well known reader preempts writer livelock. RT prevents this by
     storing the associated lock pointer independent of lockdep in the
     seqcount and changing the reader side to block on the lock when a
     reader detects that a writer is in the write side critical section.

   - Conversion of seqcount usage sites to associated types and
     initializers"

* tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
  locking/seqlock, headers: Untangle the spaghetti monster
  locking, arch/ia64: Reduce  header dependencies by moving XTP bits into the new  header
  x86/headers: Remove APIC headers from 
  seqcount: More consistent seqprop names
  seqcount: Compress SEQCNT_LOCKNAME_ZERO()
  seqlock: Fold seqcount_LOCKNAME_init() definition
  seqlock: Fold seqcount_LOCKNAME_t definition
  seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
  hrtimer: Use sequence counter with associated raw spinlock
  kvm/eventfd: Use sequence counter with associated spinlock
  userfaultfd: Use sequence counter with associated spinlock
  NFSv4: Use sequence counter with associated spinlock
  iocost: Use sequence counter with associated spinlock
  raid5: Use sequence counter with associated spinlock
  vfs: Use sequence counter with associated spinlock
  timekeeping: Use sequence counter with associated raw spinlock
  xfrm: policy: Use sequence counters with associated lock
  netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
  netfilter: conntrack: Use sequence counter with associated spinlock
  sched: tasks: Use sequence counter with associated spinlock
  ...

userfaultfd: simplify fault handling

2020-08-03T18:25:16+00:00

Instead of waiting in a loop for the userfaultfd condition to become
true, just wait once and return VM_FAULT_RETRY.

We've already dropped the mmap lock, we know we can't really
successfully handle the fault at this point and the caller will have to
retry anyway.  So there's no point in making the wait any more
complicated than it needs to be - just schedule away.

And once you don't have that complexity with explicit looping, you can
also just lose all the 'userfaultfd_signal_pending()' complexity,
because once we've set the correct process sleeping state, and don't
loop, the act of scheduling itself will be checking if there are any
pending signals before going to sleep.

We can also drop the VM_FAULT_MAJOR games, since we'll be treating all
retried faults as major soon anyway (series to regularize and share more
of fault handling across architectures in a separate series by Peter Xu,
and in the meantime we won't worry about the possible minor - I'll be
here all week, try the veal - accounting difference).

Cc: Andrea Arcangeli 
Cc: Peter Xu 
Signed-off-by: Linus Torvalds

userfaultfd: Use sequence counter with associated spinlock

2020-07-29T14:14:28+00:00

A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_spinlock_t data type, which allows to associate a
spinlock with the sequence counter. This enables lockdep to verify that
the spinlock used for writer serialization is held when the write side
critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20200720155530.1173732-23-a.darwish@linutronix.de