linux.git/mm/migrate_device.c, branch v6.5

mm: enable page walking API to lock vmas during the walk

2023-08-21T20:07:20+00:00

walk_page_range() and friends often operate under write-locked mmap_lock. 
With introduction of vma locks, the vmas have to be locked as well during
such walks to prevent concurrent page faults in these areas.  Add an
additional member to mm_walk_ops to indicate locking requirements for the
walk.

The change ensures that page walks which prevent concurrent page faults
by write-locking mmap_lock, operate correctly after introduction of
per-vma locks.  With per-vma locks page faults can be handled under vma
lock without taking mmap_lock at all, so write locking mmap_lock would
not stop them.  The change ensures vmas are properly locked during such
walks.

A sample issue this solves is do_mbind() performing queue_pages_range()
to queue pages for migration.  Without this change a concurrent page
can be faulted into the area and be left out of migration.

Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan 
Suggested-by: Linus Torvalds 
Suggested-by: Jann Horn 
Cc: David Hildenbrand 
Cc: Davidlohr Bueso 
Cc: Hugh Dickins 
Cc: Johannes Weiner 
Cc: Laurent Dufour 
Cc: Liam Howlett 
Cc: Matthew Wilcox (Oracle) 
Cc: Michal Hocko 
Cc: Michel Lespinasse 
Cc: Peter Xu 
Cc: Vlastimil Babka 
Cc: 
Signed-off-by: Andrew Morton

mm: remove references to pagevec

2023-06-23T23:59:30+00:00

Most of these should just refer to the LRU cache rather than the data
structure used to implement the LRU cache.

Link: https://lkml.kernel.org/r/20230621164557.3510324-13-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) 
Signed-off-by: Andrew Morton

mm: ptep_get() conversion

2023-06-19T23:19:25+00:00

Convert all instances of direct pte_t* dereferencing to instead use
ptep_get() helper.  This means that by default, the accesses change from a
C dereference to a READ_ONCE().  This is technically the correct thing to
do since where pgtables are modified by HW (for access/dirty) they are
volatile and therefore we should always ensure READ_ONCE() semantics.

But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte.  Arch code
is deliberately not converted, as the arch code knows best.  It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.

Conversion was done using Coccinelle:

----

// $ make coccicheck \
//          COCCI=ptepget.cocci \
//          SPFLAGS="--include-headers" \
//          MODE=patch

virtual patch

@ depends on patch @
pte_t *v;
@@

- *v
+ ptep_get(v)

----

Then reviewed and hand-edited to avoid multiple unnecessary calls to
ptep_get(), instead opting to store the result of a single call in a
variable, where it is correct to do so.  This aims to negate any cost of
READ_ONCE() and will benefit arch-overrides that may be more complex.

Included is a fix for an issue in an earlier version of this patch that
was pointed out by kernel test robot.  The issue arose because config
MMU=n elides definition of the ptep helper functions, including
ptep_get().  HUGETLB_PAGE=n configs still define a simple
huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
So when both configs are disabled, this caused a build error because
ptep_get() is not defined.  Fix by continuing to do a direct dereference
when MMU=n.  This is safe because for this config the arch code cannot be
trying to virtualize the ptes because none of the ptep helpers are
defined.

Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
Reported-by: kernel test robot 
Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
Signed-off-by: Ryan Roberts 
Cc: Adrian Hunter 
Cc: Alexander Potapenko 
Cc: Alexander Shishkin 
Cc: Alex Williamson 
Cc: Al Viro 
Cc: Andrey Konovalov 
Cc: Andrey Ryabinin 
Cc: Christian Brauner 
Cc: Christoph Hellwig 
Cc: Daniel Vetter 
Cc: Dave Airlie 
Cc: Dimitri Sivanich 
Cc: Dmitry Vyukov 
Cc: Ian Rogers 
Cc: Jason Gunthorpe 
Cc: Jérôme Glisse 
Cc: Jiri Olsa 
Cc: Johannes Weiner 
Cc: Kirill A. Shutemov 
Cc: Lorenzo Stoakes 
Cc: Mark Rutland 
Cc: Matthew Wilcox 
Cc: Miaohe Lin 
Cc: Michal Hocko 
Cc: Mike Kravetz 
Cc: Mike Rapoport (IBM) 
Cc: Muchun Song 
Cc: Namhyung Kim 
Cc: Naoya Horiguchi 
Cc: Oleksandr Tyshchenko 
Cc: Pavel Tatashin 
Cc: Roman Gushchin 
Cc: SeongJae Park 
Cc: Shakeel Butt 
Cc: Uladzislau Rezki (Sony) 
Cc: Vincenzo Frascino 
Cc: Yu Zhao 
Signed-off-by: Andrew Morton

mm/migrate_device: allow pte_offset_map_lock() to fail

2023-06-19T23:19:17+00:00

migrate_vma_collect_pmd(): remove the pmd_trans_unstable() handling after
splitting huge zero pmd, and the pmd_none() handling after successfully
splitting huge page: those are now managed inside pte_offset_map_lock(),
and by "goto again" when it fails.

But the skip after unsuccessful split_huge_page() must stay: it avoids an
endless loop.  The skip when pmd_bad()?  Remove that: it will be treated
as a hole rather than a skip once cleared by pte_offset_map_lock(), but
with different timing that would be so anyway; and it's arguably best to
leave the pmd_bad() handling centralized there.

migrate_vma_insert_page(): remove comment on the old pte_offset_map() and
old locking limitations; remove the pmd_trans_unstable() check and just
proceed to pte_offset_map_lock(), aborting when it fails (page has been
charged to memcg, but as in other cases, it's uncharged when freed).

Link: https://lkml.kernel.org/r/1131be62-2e84-da2f-8f45-807b2cbeeec5@google.com
Signed-off-by: Hugh Dickins 
Reviewed-by: Alistair Popple 
Cc: Anshuman Khandual 
Cc: Axel Rasmussen 
Cc: Christophe Leroy 
Cc: Christoph Hellwig 
Cc: David Hildenbrand 
Cc: "Huang, Ying" 
Cc: Ira Weiny 
Cc: Jason Gunthorpe 
Cc: Kirill A. Shutemov 
Cc: Lorenzo Stoakes 
Cc: Matthew Wilcox 
Cc: Mel Gorman 
Cc: Miaohe Lin 
Cc: Mike Kravetz 
Cc: Mike Rapoport (IBM) 
Cc: Minchan Kim 
Cc: Naoya Horiguchi 
Cc: Pavel Tatashin 
Cc: Peter Xu 
Cc: Peter Zijlstra 
Cc: Qi Zheng 
Cc: Ralph Campbell 
Cc: Ryan Roberts 
Cc: SeongJae Park 
Cc: Song Liu 
Cc: Steven Price 
Cc: Suren Baghdasaryan 
Cc: Thomas Hellström 
Cc: Will Deacon 
Cc: Yang Shi 
Cc: Yu Zhao 
Cc: Zack Rusin 
Signed-off-by: Andrew Morton

mm: change to return bool for isolate_lru_page()

2023-02-20T20:46:17+00:00

The isolate_lru_page() can only return 0 or -EBUSY, and most users did not
care about the negative error of isolate_lru_page(), except one user in
add_page_for_migration().  So we can convert the isolate_lru_page() to
return a boolean value, which can help to make the code more clear when
checking the return value of isolate_lru_page().

Also convert all users' logic of checking the isolation state.

No functional changes intended.

Link: https://lkml.kernel.org/r/3074c1ab628d9dbf139b33f248a8bc253a3f95f0.1676424378.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang 
Acked-by: David Hildenbrand 
Reviewed-by: Matthew Wilcox (Oracle) 
Acked-by: Linus Torvalds 
Reviewed-by: SeongJae Park 
Signed-off-by: Andrew Morton

mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export

2023-02-03T06:32:54+00:00

mmu_notifier_range_update_to_read_only() was originally introduced in
commit c6d23413f81b ("mm/mmu_notifier:
mmu_notifier_range_update_to_read_only() helper") as an optimisation for
device drivers that know a range has only been mapped read-only.  However
there are no users of this feature so remove it.  As it is the only user
of the struct mmu_notifier_range.vma field remove that also.

Link: https://lkml.kernel.org/r/20230110025722.600912-1-apopple@nvidia.com
Signed-off-by: Alistair Popple 
Acked-by: Mike Rapoport (IBM) 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Mike Kravetz 
Cc: Ira Weiny 
Cc: Jerome Glisse 
Cc: John Hubbard 
Cc: Ralph Campbell 
Signed-off-by: Andrew Morton

mm/migrate_device: return number of migrating pages in args->cpages

2022-11-23T02:50:43+00:00

migrate_vma->cpages originally contained a count of the number of pages
migrating including non-present pages which can be populated directly on
the target.

Commit 241f68859656 ("mm/migrate_device.c: refactor migrate_vma and
migrate_device_coherent_page()") inadvertantly changed this to contain
just the number of pages that were unmapped.  Usage of migrate_vma->cpages
isn't documented, but most drivers use it to see if all the requested
addresses can be migrated so restore the original behaviour.

Link: https://lkml.kernel.org/r/20221111005135.1344004-1-apopple@nvidia.com
Fixes: 241f68859656 ("mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page()")
Signed-off-by: Alistair Popple 
Reported-by: Ralph Campbell 
Reviewed-by: Ralph Campbell 
Cc: John Hubbard 
Cc: Alex Sierra 
Cc: Ben Skeggs 
Cc: Felix Kuehling 
Cc: Lyude Paul 
Cc: Jason Gunthorpe 
Cc: Michael Ellerman 
Signed-off-by: Andrew Morton

mm/migrate_device.c: add migrate_device_range()

2022-10-13T01:51:49+00:00

Device drivers can use the migrate_vma family of functions to migrate
existing private anonymous mappings to device private pages.  These pages
are backed by memory on the device with drivers being responsible for
copying data to and from device memory.

Device private pages are freed via the pgmap->page_free() callback when
they are unmapped and their refcount drops to zero.  Alternatively they
may be freed indirectly via migration back to CPU memory in response to a
pgmap->migrate_to_ram() callback called whenever the CPU accesses an
address mapped to a device private page.

In other words drivers cannot control the lifetime of data allocated on
the devices and must wait until these pages are freed from userspace. 
This causes issues when memory needs to reclaimed on the device, either
because the device is going away due to a ->release() callback or because
another user needs to use the memory.

Drivers could use the existing migrate_vma functions to migrate data off
the device.  However this would require them to track the mappings of each
page which is both complicated and not always possible.  Instead drivers
need to be able to migrate device pages directly so they can free up
device memory.

To allow that this patch introduces the migrate_device family of functions
which are functionally similar to migrate_vma but which skips the initial
lookup based on mapping.

Link: https://lkml.kernel.org/r/868116aab70b0c8ee467d62498bb2cf0ef907295.1664366292.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple 
Cc: "Huang, Ying" 
Cc: Zi Yan 
Cc: Matthew Wilcox 
Cc: Yang Shi 
Cc: David Hildenbrand 
Cc: Ralph Campbell 
Cc: John Hubbard 
Cc: Alex Deucher 
Cc: Alex Sierra 
Cc: Ben Skeggs 
Cc: Christian König 
Cc: Dan Williams 
Cc: Felix Kuehling 
Cc: Jason Gunthorpe 
Cc: Lyude Paul 
Cc: Michael Ellerman 
Signed-off-by: Andrew Morton

mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page()

2022-10-13T01:51:49+00:00

migrate_device_coherent_page() reuses the existing migrate_vma family of
functions to migrate a specific page without providing a valid mapping or
vma.  This looks a bit odd because it means we are calling migrate_vma_*()
without setting a valid vma, however it was considered acceptable at the
time because the details were internal to migrate_device.c and there was
only a single user.

One of the reasons the details could be kept internal was that this was
strictly for migrating device coherent memory.  Such memory can be copied
directly by the CPU without intervention from a driver.  However this
isn't true for device private memory, and a future change requires similar
functionality for device private memory.  So refactor the code into
something more sensible for migrating device memory without a vma.

Link: https://lkml.kernel.org/r/c7b2ff84e9b33d022cf4a40f87d051f281a16d8f.1664366292.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple 
Cc: "Huang, Ying" 
Cc: Zi Yan 
Cc: Matthew Wilcox 
Cc: Yang Shi 
Cc: David Hildenbrand 
Cc: Ralph Campbell 
Cc: John Hubbard 
Cc: Alex Deucher 
Cc: Alex Sierra 
Cc: Ben Skeggs 
Cc: Christian König 
Cc: Dan Williams 
Cc: Felix Kuehling 
Cc: Jason Gunthorpe 
Cc: Lyude Paul 
Cc: Michael Ellerman 
Signed-off-by: Andrew Morton

mm/memory.c: fix race when faulting a device private page

2022-10-13T01:51:49+00:00

Patch series "Fix several device private page reference counting issues",
v2

This series aims to fix a number of page reference counting issues in
drivers dealing with device private ZONE_DEVICE pages.  These result in
use-after-free type bugs, either from accessing a struct page which no
longer exists because it has been removed or accessing fields within the
struct page which are no longer valid because the page has been freed.

During normal usage it is unlikely these will cause any problems.  However
without these fixes it is possible to crash the kernel from userspace. 
These crashes can be triggered either by unloading the kernel module or
unbinding the device from the driver prior to a userspace task exiting. 
In modules such as Nouveau it is also possible to trigger some of these
issues by explicitly closing the device file-descriptor prior to the task
exiting and then accessing device private memory.

This involves some minor changes to both PowerPC and AMD GPU code. 
Unfortunately I lack hardware to test either of those so any help there
would be appreciated.  The changes mimic what is done in for both Nouveau
and hmm-tests though so I doubt they will cause problems.


This patch (of 8):

When the CPU tries to access a device private page the migrate_to_ram()
callback associated with the pgmap for the page is called.  However no
reference is taken on the faulting page.  Therefore a concurrent migration
of the device private page can free the page and possibly the underlying
pgmap.  This results in a race which can crash the kernel due to the
migrate_to_ram() function pointer becoming invalid.  It also means drivers
can't reliably read the zone_device_data field because the page may have
been freed with memunmap_pages().

Close the race by getting a reference on the page while holding the ptl to
ensure it has not been freed.  Unfortunately the elevated reference count
will cause the migration required to handle the fault to fail.  To avoid
this failure pass the faulting page into the migrate_vma functions so that
if an elevated reference count is found it can be checked to see if it's
expected or not.

[mpe@ellerman.id.au: fix build]
  Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au
Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com
Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple 
Acked-by: Felix Kuehling 
Cc: Jason Gunthorpe 
Cc: John Hubbard 
Cc: Ralph Campbell 
Cc: Michael Ellerman 
Cc: Lyude Paul 
Cc: Alex Deucher 
Cc: Alex Sierra 
Cc: Ben Skeggs 
Cc: Christian König 
Cc: Dan Williams 
Cc: David Hildenbrand 
Cc: "Huang, Ying" 
Cc: Matthew Wilcox 
Cc: Yang Shi 
Cc: Zi Yan 
Signed-off-by: Andrew Morton