linux-stable.git/drivers/gpu/drm/xe, branch v6.14

drm/xe: Fix exporting xe buffers multiple times

2025-03-20T16:59:49+00:00

The `struct ttm_resource->placement` contains TTM_PL_FLAG_* flags, but
it was incorrectly tested for XE_PL_* flags.
This caused xe_dma_buf_pin() to always fail when invoked for
the second time. Fix this by checking the `mem_type` field instead.

Fixes: 7764222d54b7 ("drm/xe: Disallow pinning dma-bufs in VRAM")
Cc: Thomas Hellström 
Cc: Rodrigo Vivi 
Cc: Lucas De Marchi 
Cc: "Thomas Hellström" 
Cc: Michal Wajdeczko 
Cc: Matthew Brost 
Cc: Matthew Auld 
Cc: Nirmoy Das 
Cc: Jani Nikula 
Cc: intel-xe@lists.freedesktop.org
Cc:  # v6.8+
Signed-off-by: Tomasz Rusinowicz 
Signed-off-by: Jacek Lawrynowicz 
Reviewed-by: Matthew Brost 
Link: https://patchwork.freedesktop.org/patch/msgid/20250218100353.2137964-1-jacek.lawrynowicz@linux.intel.com
Signed-off-by: Thomas Hellström 
(cherry picked from commit b96dabdba9b95f71ded50a1c094ee244408b2a8e)
Signed-off-by: Thomas Hellström

drm/xe: remove redundant check in xe_vm_create_ioctl()

2025-03-10T18:01:43+00:00

The check for args->extensions is repeated twice in xe_vm_create_ioctl().
This commit removes the redundant check to streamline the code.

Fixes: 7224788f6756 ("drm/xe: Kill XE_VM_PROPERTY_BIND_OP_ERROR_CAPTURE_ADDRESS extension")
Cc: Rodrigo Vivi 
Signed-off-by: Xin Wang 
Reviewed-by: Tejas Upadhyay 
Reviewed-by: Matthew Auld 
Link: https://patchwork.freedesktop.org/patch/msgid/20250303004942.951699-1-x.wang@intel.com
Signed-off-by: Rodrigo Vivi 
(cherry picked from commit 8da8aecf1f2d89c2b8188bcf7aa252ec146ddd12)
Signed-off-by: Rodrigo Vivi

drm/xe/guc_pc: Retry and wait longer for GuC PC start

2025-03-10T15:53:37+00:00

In a rare situation of thermal limit during resume, GuC can
be slow and run into delays like this:

xe 0000:00:02.0: [drm] GT1: excessive init time: 667ms! \
   		 [status = 0x8002F034, timeouts = 0]
xe 0000:00:02.0: [drm] GT1: excessive init time: \
   		 [freq = 100MHz (req = 800MHz), before = 100MHz, \
   		 perf_limit_reasons = 0x1C001000]
xe 0000:00:02.0: [drm] *ERROR* GT1: GuC PC Start failed
------------[ cut here ]------------
xe 0000:00:02.0: [drm] GT1: Failed to start GuC PC: -EIO

When this happens, it will block entirely the GPU to be used.
So, let's try and with a huge timeout in the hope it comes back.

Also, let's collect some information on how long it is usually
taking on situations like this, so perhaps the time can be tuned
later.

Cc: Vinay Belgaumkar 
Cc: Jonathan Cavitt 
Cc: John Harrison 
Reviewed-by: Jonathan Cavitt 
Link: https://patchwork.freedesktop.org/patch/msgid/20250307160307.1093391-1-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi 
(cherry picked from commit b4b05e53b550a886b4754b87fd0dd2b304579e85)
Signed-off-by: Rodrigo Vivi

drm/xe/pm: Temporarily disable D3Cold on BMG

2025-03-10T15:42:32+00:00

Currently, many instability cases related to D3Cold -> D0 transition
on BMG are under investigation. Among them some bad cases where
the device is lost after 1 to 3 transitions from D3Cold to D0
on the runtime pm, with pcieport upstream bridge port link retrain
failure.

In other cases, it works fine, but with some sudden random memory
corruptions after D3cold, that could be 0xffff missed ack on GT
forcewake or GuC reload related failures.

In some other cases though, D3Cold -> D0 works pretty reliably.
It looks like it is a combination of GPU cards and Host boards at
this point. So, there is no possible/available quirk at this time.

This patch disables the D3Cold by default on BMG by reducing the
vram_d3cold_threshold to 0. Users and developers who wants to enable
it are still able to via
$ echo 300 > /sys/bus/pci/devices//vram_d3cold_threshold

Fixes: 3adcf970dc7e ("drm/xe/bmg: Drop force_probe requirement")
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4037
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4395
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4396
Cc: Karthik Poosa 
Reviewed-by: Lucas De Marchi 
Link: https://patchwork.freedesktop.org/patch/msgid/20250308005636.1475420-1-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi 
(cherry picked from commit d945cc876277851053c0cf37927c8d7bd9d0e880)
Signed-off-by: Rodrigo Vivi

drm/xe/userptr: Fix an incorrect assert

2025-03-10T15:42:25+00:00

The assert incorrectly checks the total length processed which
can in fact be greater than the number of pages. Fix.

Fixes: 0a98219bcc96 ("drm/xe/hmm: Don't dereference struct page pointers without notifier lock")
Cc: Matthew Auld 
Cc: Matthew Brost 
Signed-off-by: Thomas Hellström 
Reviewed-by: Matthew Auld 
Link: https://patchwork.freedesktop.org/patch/msgid/20250307100109.21397-1-thomas.hellstrom@linux.intel.com
(cherry picked from commit 70e5043ba85eae199b232e39921abd706b5c1fa4)
Signed-off-by: Rodrigo Vivi

drm/xe: Release guc ids before cancelling work

2025-03-10T15:42:21+00:00

A GT resets can be occurring in parallel while cancelling
work in async call  which can requeue these workers.
to avoid that, lets first release guc ids and then cancel
work so they don't requeued.

Fixes: 8ae8a2e8dd21 ("drm/xe: Long running job update")
Fixes: 12c2f962fe71 ("drm/xe: cancel pending job timer before freeing scheduler")
Signed-off-by: Tejas Upadhyay 
Suggested-by: Matthew Brost 
Reviewed-by: Matthew Brost 
Link: https://patchwork.freedesktop.org/patch/msgid/20250306131211.975503-1-tejas.upadhyay@intel.com
Signed-off-by: Lucas De Marchi 
(cherry picked from commit 8e8d76f62329127b31c64a034b052fb9e30e92af)
Signed-off-by: Rodrigo Vivi

drm/xe/userptr: Unmap userptrs in the mmu notifier

2025-03-05T19:25:27+00:00

If userptr pages are freed after a call to the xe mmu notifier,
the device will not be blocked out from theoretically accessing
these pages unless they are also unmapped from the iommu, and
this violates some aspects of the iommu-imposed security.

Ensure that userptrs are unmapped in the mmu notifier to
mitigate this. A naive attempt would try to free the sg table, but
the sg table itself may be accessed by a concurrent bind
operation, so settle for only unmapping.

v3:
- Update lockdep asserts.
- Fix a typo (Matthew Auld)

Fixes: 81e058a3e7fd ("drm/xe: Introduce helper to populate userptr")
Cc: Oak Zeng 
Cc: Matthew Auld 
Cc:  # v6.10+
Signed-off-by: Thomas Hellström 
Reviewed-by: Matthew Auld 
Acked-by: Matthew Brost 
Link: https://patchwork.freedesktop.org/patch/msgid/20250304173342.22009-4-thomas.hellstrom@linux.intel.com
(cherry picked from commit ba767b9d01a2c552d76cf6f46b125d50ec4147a6)
Signed-off-by: Rodrigo Vivi

drm/xe/hmm: Don't dereference struct page pointers without notifier lock

2025-03-05T19:25:23+00:00

The pnfs that we obtain from hmm_range_fault() point to pages that
we don't have a reference on, and the guarantee that they are still
in the cpu page-tables is that the notifier lock must be held and the
notifier seqno is still valid.

So while building the sg table and marking the pages accesses / dirty
we need to hold this lock with a validated seqno.

However, the lock is reclaim tainted which makes
sg_alloc_table_from_pages_segment() unusable, since it internally
allocates memory.

Instead build the sg-table manually. For the non-iommu case
this might lead to fewer coalesces, but if that's a problem it can
be fixed up later in the resource cursor code. For the iommu case,
the whole sg-table may still be coalesced to a single contigous
device va region.

This avoids marking pages that we don't own dirty and accessed, and
it also avoid dereferencing struct pages that we don't own.

v2:
- Use assert to check whether hmm pfns are valid (Matthew Auld)
- Take into account that large pages may cross range boundaries
  (Matthew Auld)

v3:
- Don't unnecessarily check for a non-freed sg-table. (Matthew Auld)
- Add a missing up_read() in an error path. (Matthew Auld)

Fixes: 81e058a3e7fd ("drm/xe: Introduce helper to populate userptr")
Cc: Oak Zeng 
Cc:  # v6.10+
Signed-off-by: Thomas Hellström 
Reviewed-by: Matthew Auld 
Acked-by: Matthew Brost 
Link: https://patchwork.freedesktop.org/patch/msgid/20250304173342.22009-3-thomas.hellstrom@linux.intel.com
(cherry picked from commit ea3e66d280ce2576664a862693d1da8fd324c317)
Signed-off-by: Rodrigo Vivi

drm/xe/hmm: Style- and include fixes

2025-03-05T19:25:18+00:00

Add proper #ifndef around the xe_hmm.h header, proper spacing
and since the documentation mostly follows kerneldoc format,
make it kerneldoc. Also prepare for upcoming -stable fixes.

Fixes: 81e058a3e7fd ("drm/xe: Introduce helper to populate userptr")
Cc: Oak Zeng 
Cc:  # v6.10+
Signed-off-by: Thomas Hellström 
Reviewed-by: Matthew Auld 
Acked-by: Matthew Brost 
Link: https://patchwork.freedesktop.org/patch/msgid/20250304173342.22009-2-thomas.hellstrom@linux.intel.com
(cherry picked from commit bbe2b06b55bc061c8fcec034ed26e88287f39143)
Signed-off-by: Rodrigo Vivi

drm/xe: Add staging tree for VM binds

2025-03-05T19:25:11+00:00

Concurrent VM bind staging and zapping of PTEs from a userptr notifier
do not work because the view of PTEs is not stable. VM binds cannot
acquire the notifier lock during staging, as memory allocations are
required. To resolve this race condition, use a staging tree for VM
binds that is committed only under the userptr notifier lock during the
final step of the bind. This ensures a consistent view of the PTEs in
the userptr notifier.

A follow up may only use staging for VM in fault mode as this is the
only mode in which the above race exists.

v3:
 - Drop zap PTE change (Thomas)
 - s/xe_pt_entry/xe_pt_entry_staging (Thomas)

Suggested-by: Thomas Hellström 
Cc: 
Fixes: e8babb280b5e ("drm/xe: Convert multiple bind ops into single job")
Fixes: a708f6501c69 ("drm/xe: Update PT layer with better error handling")
Signed-off-by: Matthew Brost 
Reviewed-by: Thomas Hellström 
Link: https://patchwork.freedesktop.org/patch/msgid/20250228073058.59510-5-thomas.hellstrom@linux.intel.com
Signed-off-by: Thomas Hellström 
(cherry picked from commit 6f39b0c5ef0385eae586760d10b9767168037aa5)
Signed-off-by: Rodrigo Vivi