linux.git/mm/userfaultfd.c, branch v5.14

userfaultfd/shmem: modify shmem_mfill_atomic_pte to use install_pte()

2021-07-01T03:47:27+00:00

In a previous commit, we added the mfill_atomic_install_pte() helper.
This helper does the job of setting up PTEs for an existing page, to map
it into a given VMA.  It deals with both the anon and shmem cases, as well
as the shared and private cases.

In other words, shmem_mfill_atomic_pte() duplicates a case it already
handles.  So, expose it, and let shmem_mfill_atomic_pte() use it directly,
to reduce code duplication.

This requires that we refactor shmem_mfill_atomic_pte() a bit:

Instead of doing accounting (shmem_recalc_inode() et al) part-way through
the PTE setup, do it afterward.  This frees up mfill_atomic_install_pte()
from having to care about this accounting, and means we don't need to e.g.
shmem_uncharge() in the error path.

A side effect is this switches shmem_mfill_atomic_pte() to use
lru_cache_add_inactive_or_unevictable() instead of just lru_cache_add().
This wrapper does some extra accounting in an exceptional case, if
appropriate, so it's actually the more correct thing to use.

Link: https://lkml.kernel.org/r/20210503180737.2487560-7-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen 
Reviewed-by: Peter Xu 
Acked-by: Hugh Dickins 
Cc: Alexander Viro 
Cc: Andrea Arcangeli 
Cc: Brian Geffon 
Cc: "Dr . David Alan Gilbert" 
Cc: Jerome Glisse 
Cc: Joe Perches 
Cc: Kirill A. Shutemov 
Cc: Lokesh Gidra 
Cc: Mike Kravetz 
Cc: Mike Rapoport 
Cc: Mina Almasry 
Cc: Oliver Upton 
Cc: Shaohua Li 
Cc: Shuah Khan 
Cc: Stephen Rothwell 
Cc: Wang Qing 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

userfaultfd/shmem: support UFFDIO_CONTINUE for shmem

2021-07-01T03:47:27+00:00

With this change, userspace can resolve a minor fault within a
shmem-backed area with a UFFDIO_CONTINUE ioctl.  The semantics for this
match those for hugetlbfs - we look up the existing page in the page
cache, and install a PTE for it.

This commit introduces a new helper: mfill_atomic_install_pte.

Why handle UFFDIO_CONTINUE for shmem in mm/userfaultfd.c, instead of in
shmem.c?  The existing userfault implementation only relies on shmem.c for
VM_SHARED VMAs.  However, minor fault handling / CONTINUE work just fine
for !VM_SHARED VMAs as well.  We'd prefer to handle CONTINUE for shmem in
one place, regardless of shared/private (to reduce code duplication).

Why add a new mfill_atomic_install_pte helper?  A problem we have with
continue is that shmem_mfill_atomic_pte() and mcopy_atomic_pte() are
*close* to what we want, but not exactly.  We do want to setup the PTEs in
a CONTINUE operation, but we don't want to e.g.  allocate a new page,
charge it (e.g.  to the shmem inode), manipulate various flags, etc.  Also
we have the problem stated above: shmem_mfill_atomic_pte() and
mcopy_atomic_pte() both handle one-half of the problem (shared / private)
continue cares about.  So, introduce mcontinue_atomic_pte(), to handle all
of the shmem continue cases.  Introduce the helper so it doesn't duplicate
code with mcopy_atomic_pte().

In a future commit, shmem_mfill_atomic_pte() will also be modified to use
this new helper.  However, since this is a bigger refactor, it seems most
clear to do it as a separate change.

Link: https://lkml.kernel.org/r/20210503180737.2487560-5-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen 
Acked-by: Hugh Dickins 
Acked-by: Peter Xu 
Cc: Alexander Viro 
Cc: Andrea Arcangeli 
Cc: Brian Geffon 
Cc: "Dr . David Alan Gilbert" 
Cc: Jerome Glisse 
Cc: Joe Perches 
Cc: Kirill A. Shutemov 
Cc: Lokesh Gidra 
Cc: Mike Kravetz 
Cc: Mike Rapoport 
Cc: Mina Almasry 
Cc: Oliver Upton 
Cc: Shaohua Li 
Cc: Shuah Khan 
Cc: Stephen Rothwell 
Cc: Wang Qing 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte

2021-07-01T03:47:27+00:00

Patch series "userfaultfd: add minor fault handling for shmem", v6.

Overview
========

See the series which added minor faults for hugetlbfs [3] for a detailed
overview of minor fault handling in general.  This series adds the same
support for shmem-backed areas.

This series is structured as follows:

- Commits 1 and 2 are cleanups.
- Commits 3 and 4 implement the new feature (minor fault handling for shmem).
- Commit 5 advertises that the feature is now available since at this point it's
  fully implemented.
- Commit 6 is a final cleanup, modifying an existing code path to re-use a new
  helper we've introduced.
- Commits 7, 8, 9, 10 update the userfaultfd selftest to exercise the feature.

Use Case
========

In some cases it is useful to have VM memory backed by tmpfs instead of
hugetlbfs.  So, this feature will be used to support the same VM live
migration use case described in my original series.

Additionally, Android folks (Lokesh Gidra ) hope
to optimize the Android Runtime garbage collector using this feature:

"The plan is to use userfaultfd for concurrently compacting the heap.
With this feature, the heap can be shared-mapped at another location where
the GC-thread(s) could continue the compaction operation without the need
to invoke userfault ioctl(UFFDIO_COPY) each time.  OTOH, if and when Java
threads get faults on the heap, UFFDIO_CONTINUE can be used to resume
execution.  Furthermore, this feature enables updating references in the
'non-moving' portion of the heap efficiently.  Without this feature,
uneccessary page copying (ioctl(UFFDIO_COPY)) would be required."

[1] https://lore.kernel.org/patchwork/cover/1388144/
[2] https://lore.kernel.org/patchwork/patch/1408161/
[3] https://lore.kernel.org/linux-fsdevel/20210301222728.176417-1-axelrasmussen@google.com/T/#t

This patch (of 9):

Previously, we did a dance where we had one calling path in userfaultfd.c
(mfill_atomic_pte), but then we split it into two in shmem_fs.h
(shmem_{mcopy_atomic,mfill_zeropage}_pte), and then rejoined into a single
shared function in shmem.c (shmem_mfill_atomic_pte).

This is all a bit overly complex.  Just call the single combined shmem
function directly, allowing us to clean up various branches, boilerplate,
etc.

While we're touching this function, two other small cleanup changes:
- offset is equivalent to pgoff, so we can get rid of offset entirely.
- Split two VM_BUG_ON cases into two statements. This means the line
  number reported when the BUG is hit specifies exactly which condition
  was true.

Link: https://lkml.kernel.org/r/20210503180737.2487560-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210503180737.2487560-3-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen 
Reviewed-by: Peter Xu 
Acked-by: Hugh Dickins 
Cc: Alexander Viro 
Cc: Andrea Arcangeli 
Cc: Brian Geffon 
Cc: "Dr . David Alan Gilbert" 
Cc: Jerome Glisse 
Cc: Joe Perches 
Cc: Kirill A. Shutemov 
Cc: Lokesh Gidra 
Cc: Mike Kravetz 
Cc: Mike Rapoport 
Cc: Mina Almasry 
Cc: Oliver Upton 
Cc: Shaohua Li 
Cc: Shuah Khan 
Cc: Stephen Rothwell 
Cc: Wang Qing 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY

2021-07-01T03:47:26+00:00

On UFFDIO_COPY, if we fail to copy the page contents while holding the
hugetlb_fault_mutex, we will drop the mutex and return to the caller after
allocating a page that consumed a reservation.  In this case there may be
a fault that double consumes the reservation.  To handle this, we free the
allocated page, fix the reservations, and allocate a temporary hugetlb
page and return that to the caller.  When the caller does the copy outside
of the lock, we again check the cache, and allocate a page consuming the
reservation, and copy over the contents.

Test:
Hacked the code locally such that resv_huge_pages underflows produce
a warning and the copy_huge_page_from_user() always fails, then:

./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
        2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
./tools/testing/selftests/vm/userfaultfd hugetlb 10
	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success

Both tests succeed and produce no warnings. After the
test runs number of free/resv hugepages is correct.

[yuehaibing@huawei.com: remove set but not used variable 'vm_alloc_shared']
  Link: https://lkml.kernel.org/r/20210601141610.28332-1-yuehaibing@huawei.com
[almasrymina@google.com: fix allocation error check and copy func name]
  Link: https://lkml.kernel.org/r/20210605010626.1459873-1-almasrymina@google.com

Link: https://lkml.kernel.org/r/20210528005029.88088-1-almasrymina@google.com
Signed-off-by: Mina Almasry 
Signed-off-by: YueHaibing 
Cc: Axel Rasmussen 
Cc: Peter Xu 
Cc: Mike Kravetz 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

userfaultfd: hugetlbfs: fix new flag usage in error path

2021-05-23T01:09:07+00:00

In commit d6995da31122 ("hugetlb: use page.private for hugetlb specific
page flags") the use of PagePrivate to indicate a reservation count
should be restored at free time was changed to the hugetlb specific flag
HPageRestoreReserve.  Changes to a userfaultfd error path as well as a
VM_BUG_ON() in remove_inode_hugepages() were overlooked.

Users could see incorrect hugetlb reserve counts if they experience an
error with a UFFDIO_COPY operation.  Specifically, this would be the
result of an unlikely copy_huge_page_from_user error.  There is not an
increased chance of hitting the VM_BUG_ON.

Link: https://lkml.kernel.org/r/20210521233952.236434-1-mike.kravetz@oracle.com
Fixes: d6995da31122 ("hugetlb: use page.private for hugetlb specific page flags")
Signed-off-by: Mike Kravetz 
Reviewed-by: Mina Almasry 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Muchun Song 
Cc: Naoya Horiguchi 
Cc: David Hildenbrand 
Cc: Matthew Wilcox 
Cc: Miaohe Lin 
Cc: Mina Almasry 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

userfaultfd: add UFFDIO_CONTINUE ioctl

2021-05-05T18:27:22+00:00

This ioctl is how userspace ought to resolve "minor" userfaults.  The
idea is, userspace is notified that a minor fault has occurred.  It
might change the contents of the page using its second non-UFFD mapping,
or not.  Then, it calls UFFDIO_CONTINUE to tell the kernel "I have
ensured the page contents are correct, carry on setting up the mapping".

Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
MINOR registered VMAs.  ZEROPAGE maps the VMA to the zero page; but in
the minor fault case, we already have some pre-existing underlying page.
Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
We'd just use memcpy() or similar instead.

It turns out hugetlb_mcopy_atomic_pte() already does very close to what
we want, if an existing page is provided via `struct page **pagep`.  We
already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
just extend that design: add an enum for the three modes of operation,
and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
case.  (Basically, look up the existing page, and avoid adding the
existing page to the page cache or calling set_page_huge_active() on
it.)

Link: https://lkml.kernel.org/r/20210301222728.176417-5-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen 
Reviewed-by: Peter Xu 
Cc: Adam Ruprecht 
Cc: Alexander Viro 
Cc: Alexey Dobriyan 
Cc: Andrea Arcangeli 
Cc: Anshuman Khandual 
Cc: Cannon Matthews 
Cc: Catalin Marinas 
Cc: Chinwen Chang 
Cc: David Rientjes 
Cc: "Dr . David Alan Gilbert" 
Cc: Huang Ying 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: Jerome Glisse 
Cc: Kirill A. Shutemov 
Cc: Lokesh Gidra 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Michael Ellerman 
Cc: "Michal Koutn" 
Cc: Michel Lespinasse 
Cc: Mike Kravetz 
Cc: Mike Rapoport 
Cc: Mina Almasry 
Cc: Nicholas Piggin 
Cc: Oliver Upton 
Cc: Shaohua Li 
Cc: Shawn Anastasio 
Cc: Steven Price 
Cc: Steven Rostedt 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

hugetlb: pass vma into huge_pte_alloc() and huge_pmd_share()

2021-05-05T18:27:20+00:00

Patch series "hugetlb: Disable huge pmd unshare for uffd-wp", v4.

This series tries to disable huge pmd unshare of hugetlbfs backed memory
for uffd-wp.  Although uffd-wp of hugetlbfs is still during rfc stage,
the idea of this series may be needed for multiple tasks (Axel's uffd
minor fault series, and Mike's soft dirty series), so I picked it out
from the larger series.

This patch (of 4):

It is a preparation work to be able to behave differently in the per
architecture huge_pte_alloc() according to different VMA attributes.

Pass it deeper into huge_pmd_share() so that we can avoid the find_vma() call.

[peterx@redhat.com: build fix]
  Link: https://lkml.kernel.org/r/20210304164653.GB397383@xz-x1Link: https://lkml.kernel.org/r/20210218230633.15028-1-peterx@redhat.com

Link: https://lkml.kernel.org/r/20210218230633.15028-2-peterx@redhat.com
Signed-off-by: Peter Xu 
Suggested-by: Mike Kravetz 
Cc: Adam Ruprecht 
Cc: Alexander Viro 
Cc: Alexey Dobriyan 
Cc: Andrea Arcangeli 
Cc: Anshuman Khandual 
Cc: Axel Rasmussen 
Cc: Cannon Matthews 
Cc: Catalin Marinas 
Cc: Chinwen Chang 
Cc: David Rientjes 
Cc: "Dr . David Alan Gilbert" 
Cc: Huang Ying 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: Jerome Glisse 
Cc: Kirill A. Shutemov 
Cc: Lokesh Gidra 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Michael Ellerman 
Cc: "Michal Koutn" 
Cc: Michel Lespinasse 
Cc: Mike Rapoport 
Cc: Mina Almasry 
Cc: Nicholas Piggin 
Cc: Oliver Upton 
Cc: Shaohua Li 
Cc: Shawn Anastasio 
Cc: Steven Price 
Cc: Steven Rostedt 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/vmscan: protect the workingset on anonymous LRU

2020-08-12T17:57:55+00:00

In current implementation, newly created or swap-in anonymous page is
started on active list.  Growing active list results in rebalancing
active/inactive list so old pages on active list are demoted to inactive
list.  Hence, the page on active list isn't protected at all.

Following is an example of this situation.

Assume that 50 hot pages on active list.  Numbers denote the number of
pages on active/inactive list (active | inactive).

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)

3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)

This patch tries to fix this issue.  Like as file LRU, newly created or
swap-in anonymous pages will be inserted to the inactive list.  They are
promoted to active list if enough reference happens.  This simple
modification changes the above example as following.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)

3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)

As you can see, hot pages on active list would be protected.

Note that, this implementation has a drawback that the page cannot be
promoted and will be swapped-out if re-access interval is greater than the
size of inactive list but less than the size of total(active+inactive).
To solve this potential issue, following patch will apply workingset
detection similar to the one that's already applied to file LRU.

Signed-off-by: Joonsoo Kim 
Signed-off-by: Andrew Morton 
Acked-by: Johannes Weiner 
Acked-by: Vlastimil Babka 
Cc: Hugh Dickins 
Cc: Matthew Wilcox 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Minchan Kim 
Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds

mmap locking API: convert mmap_sem comments

2020-06-09T16:39:14+00:00

Convert comments that reference mmap_sem to reference mmap_lock instead.

[akpm@linux-foundation.org: fix up linux-next leftovers]
[akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
[akpm@linux-foundation.org: more linux-next fixups, per Michel]

Signed-off-by: Michel Lespinasse 
Signed-off-by: Andrew Morton 
Reviewed-by: Vlastimil Babka 
Reviewed-by: Daniel Jordan 
Cc: Davidlohr Bueso 
Cc: David Rientjes 
Cc: Hugh Dickins 
Cc: Jason Gunthorpe 
Cc: Jerome Glisse 
Cc: John Hubbard 
Cc: Laurent Dufour 
Cc: Liam Howlett 
Cc: Matthew Wilcox 
Cc: Peter Zijlstra 
Cc: Ying Han 
Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
Signed-off-by: Linus Torvalds

mmap locking API: use coccinelle to convert mmap_sem rwsem call sites

2020-06-09T16:39:14+00:00

This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse 
Signed-off-by: Andrew Morton 
Reviewed-by: Daniel Jordan 
Reviewed-by: Laurent Dufour 
Reviewed-by: Vlastimil Babka 
Cc: Davidlohr Bueso 
Cc: David Rientjes 
Cc: Hugh Dickins 
Cc: Jason Gunthorpe 
Cc: Jerome Glisse 
Cc: John Hubbard 
Cc: Liam Howlett 
Cc: Matthew Wilcox 
Cc: Peter Zijlstra 
Cc: Ying Han 
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds