linux-stable.git/drivers/gpu/drm/xe, branch v7.0.10

drm/xe/dma-buf: fix UAF with retry loop

2026-05-23T11:09:42+00:00

commit 155a372a1cc50fa93387c5d3cdfd614a61e1afd1 upstream.

Retry doesn't work here, since bo will be freed on error, leading to
UAF. However, now that we do the alloc & init before the attach, we can
now combine this as one unit and have the init do the alloc for us. This
should make the retry safe.

Reported by Sashiko.

v2: Fix up the error unwind (CI)

Closes: https://sashiko.dev/#/patchset/20260506184332.86743-2-matthew.auld%40intel.com
Fixes: eb289a5f6cc6 ("drm/xe: Convert xe_dma_buf.c for exhaustive eviction")
Signed-off-by: Matthew Auld 
Cc: Thomas Hellström 
Cc: Matthew Brost 
Cc:  # v6.18+
Reviewed-by: Thomas Hellström 
Link: https://patch.msgid.link/20260508102635.149172-4-matthew.auld@intel.com
(cherry picked from commit 479669418253e0f27f8cf5db01a731352ea592e7)
Signed-off-by: Rodrigo Vivi 
Signed-off-by: Greg Kroah-Hartman

drm/xe/dma-buf: handle empty bo and UAF races

2026-05-23T11:09:42+00:00

commit 981bedbbe61364fcc3a3b87ebaf648a66cd07108 upstream.

There look to be some nasty races here when triggering the
invalidate_mappings hook:

1) We do xe_bo_alloc() followed by the attach, before the actual full bo
   init step in xe_dma_buf_init_obj(). However the bo is visible on the
   attachments list after the attach.  This is bad since exporter driver,
   say amdgpu, can at any time call back into our invalidate_mappings hook,
   with an empty/bogus bo, leading to potential bugs/crashes.

2) Similar to 1) but here we get a UAF, when the invalidate_mappings
   hook is triggered. For example, we get as far as xe_bo_init_locked()
   but this fails in some way. But here the bo will be freed on error, but
   we still have it attached from dma-buf pov, so if the
   invalidate_mappings is now triggered then the bo we access is gone and
   we trigger UAF and more bugs/crashes.

To fix this, move the attach step until after we actually have a fully
set up buffer object. Note that the bo is not published to userspace
until later, so not sure what the comment "Don't publish the bo
until we have a valid attachment", is referring to.

We have at least two different customers reporting hitting a NULL ptr
deref in evict_flags when importing something from amdgpu, followed by
triggering the evict flow. Hit rate is also pretty low, which would
hint at some kind of race, so something like 1) or 2) might explain
this.

v2:
  - Shuffle the order of the ops slightly (no functional change)
  - Improve the comment to better explain the ordering (Matt B)

Assisted-by: Gemini:gemini-3 #debug
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/work_items/7903
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/work_items/4055
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Matthew Auld 
Cc: Thomas Hellström 
Cc: Matthew Brost 
Cc:  # v6.8+
Reviewed-by: Matthew Brost 
Acked-by: Thomas Hellström 
Link: https://patch.msgid.link/20260508102635.149172-3-matthew.auld@intel.com
(cherry picked from commit af1f2ad0c59fe4e2f924c526f66e968289d77971)
Signed-off-by: Rodrigo Vivi 
Signed-off-by: Greg Kroah-Hartman

drm/xe/xelp: Fix Wa_18022495364

2026-05-23T11:09:34+00:00

[ Upstream commit 7fe6cae2f7fad2b5166b0fc096618629f9e2ebcb ]

It looks I mistyped CS_DEBUG_MODE2 as CS_DEBUG_MODE1 when adding the
workaround. Fix it.

Signed-off-by: Tvrtko Ursulin 
Fixes: ca33cd271ef9 ("drm/xe/xelp: Add Wa_18022495364")
Cc: Matt Roper 
Cc: "Thomas Hellström" 
Cc: Rodrigo Vivi 
Cc:  # v6.18+
Reviewed-by: Matt Roper 
Signed-off-by: Thomas Hellström 
Link: https://patch.msgid.link/20260116095040.49335-1-tvrtko.ursulin@igalia.com
Stable-dep-of: 0df99689eb79 ("drm/xe/xelp: Fix Wa_18022495364")
Signed-off-by: Sasha Levin

drm/xe/gsc: Fix BO leak on error in query_compatibility_version()

2026-05-23T11:09:34+00:00

[ Upstream commit 3762d6c36549accea7068c4a175483fafdd03657 ]

When xe_gsc_read_out_header() fails, query_compatibility_version()
returns directly instead of jumping to the out_bo label. This skips
the xe_bo_unpin_map_no_vm() call, leaving the BO pinned and mapped
with no remaining reference to free it.

Fix by using goto out_bo so the error path properly cleans up the BO,
consistent with the other error handling in the same function.

Fixes: 0881cbe04077 ("drm/xe/gsc: Query GSC compatibility version")
Cc: Daniele Ceraolo Spurio 
Reviewed-by: Daniele Ceraolo Spurio 
Link: https://patch.msgid.link/20260417163308.3416147-1-shuicheng.lin@intel.com
Signed-off-by: Shuicheng Lin 
(cherry picked from commit 8de86d0a843c32ca9d36864bdb92f0376a830bce)
Signed-off-by: Rodrigo Vivi 
Signed-off-by: Sasha Levin

drm/xe/eustall: Fix drm_dev_put called before stream disable in close

2026-05-23T11:09:34+00:00

[ Upstream commit dc2d9842c67d883d3200ae33b9c3859dd9492408 ]

In xe_eu_stall_stream_close(), drm_dev_put() is called before the
stream is disabled and its resources are freed. If this drops the
last reference, the device structures could be freed while the
subsequent cleanup code still accesses them, leading to a
use-after-free.

Fix this by moving drm_dev_put() after all device accesses are
complete. This matches the ordering in xe_oa_release().

Fixes: 9a0b11d4cf3b ("drm/xe/eustall: Add support to init, enable and disable EU stall sampling")
Cc: Harish Chegondi 
Assisted-by: Claude:claude-opus-4.6
Signed-off-by: Shuicheng Lin 
Reviewed-by: Harish Chegondi 
Link: https://patch.msgid.link/20260415225428.3399934-1-shuicheng.lin@intel.com
Signed-off-by: Matt Roper 
(cherry picked from commit 35aff528f7297e949e5e19c9cd7fd748cf1cf21c)
Signed-off-by: Rodrigo Vivi 
Signed-off-by: Sasha Levin

drm/xe: Fix error cleanup in xe_exec_queue_create_ioctl()

2026-05-23T11:09:34+00:00

[ Upstream commit f3cc22d4df3ed58439ea7e21daa54c3608e03b78 ]

Two error handling issues exist in xe_exec_queue_create_ioctl():

1. When xe_hw_engine_group_add_exec_queue() fails, the error path jumps
   to put_exec_queue which skips xe_exec_queue_kill(). If the VM is in
   preempt fence mode, xe_vm_add_compute_exec_queue() has already added
   the queue to the VM's compute exec queue list. Skipping the kill
   leaves the queue on that list, leading to a dangling pointer after
   the queue is freed.

2. When xa_alloc() fails after xe_hw_engine_group_add_exec_queue() has
   succeeded, the error path does not call
   xe_hw_engine_group_del_exec_queue() to remove the queue from the hw
   engine group list. The queue is then freed while still linked into
   the hw engine group, causing a use-after-free.

Fix both by:
- Changing the xe_hw_engine_group_add_exec_queue() failure path to jump
  to kill_exec_queue so that xe_exec_queue_kill() properly removes the
  queue from the VM's compute list.
- Adding a del_hw_engine_group label before kill_exec_queue for the
  xa_alloc() failure path, which removes the queue from the hw engine
  group before proceeding with the rest of the cleanup.

Fixes: 7970cb36966c ("'drm/xe/hw_engine_group: Register hw engine group's exec queues")
Cc: Francois Dugast 
Cc: Matthew Brost 
Cc: Niranjana Vishwanathapura 
Assisted-by: Claude:claude-opus-4.6
Reviewed-by: Matthew Brost 
Link: https://patch.msgid.link/20260408020647.3397933-1-shuicheng.lin@intel.com
Signed-off-by: Shuicheng Lin 
(cherry picked from commit 37c831f401746a45d510b312b0ed7a77b1e06ec8)
Signed-off-by: Rodrigo Vivi 
Signed-off-by: Sasha Levin

drm/xe: Fix potential NULL deref in xe_exec_queue_tlb_inval_last_fence_put_unlocked

2026-05-23T11:09:34+00:00

[ Upstream commit f8c4151d50b12923b67819ebf03c1c6782c984c1 ]

xe_exec_queue_tlb_inval_last_fence_put_unlocked() uses q->vm->xe as the
first argument to xe_assert(). This function is called unconditionally
from xe_exec_queue_destroy() for all queues, including kernel queues
that have q->vm == NULL (e.g., queues created during GT init in
xe_gt_record_default_lrcs() with vm=NULL).

While current compilers optimize away the q->vm->xe dereference (even
in CONFIG_DRM_XE_DEBUG=y builds, the compiler pushes the dereference
into the WARN branch that is only taken when the assert condition is
false), the code is semantically incorrect and constitutes undefined
behavior in the C abstract machine for the NULL pointer case.

Use gt_to_xe(q->gt) instead, which is always valid for any exec queue.
This is consistent with how xe_exec_queue_destroy() itself obtains the
xe_device pointer in its own xe_assert at the top of the function.

Fixes: b2d7ec41f2a3 ("drm/xe: Attach last fence to TLB invalidation job queues")
Assisted-by: Claude:claude-opus-4.6
Reviewed-by: Matthew Brost 
Link: https://patch.msgid.link/20260409003449.3405767-1-shuicheng.lin@intel.com
Signed-off-by: Shuicheng Lin 
(cherry picked from commit 96078a1c68bf97f17fd1d08c3f58f5c5cc9ccd65)
Signed-off-by: Rodrigo Vivi 
Signed-off-by: Sasha Levin

drm/xe/debugfs: Correct printing of register whitelist ranges

2026-05-23T11:09:34+00:00

[ Upstream commit 03f2499c51dffce611b065b2894406beb9f2ebe0 ]

The register-save-restore debugfs prints whitelist entries as offset
ranges.  E.g.,

        REG[0x39319c-0x39319f]: allow read access

for a single dword-sized register.  However the GENMASK value used to
set the lower bits to '1' for the upper bound of the whitelist range
incorrectly included one more bit than it should have, causing the
whitelist ranges to sometimes appear twice as large as they really were.
For example,

        REG[0x6210-0x6217]: allow rw access

was also intended to be a single dword-sized register whitelist (with a
range 0x6210-0x6213) but was printed incorrectly as a qword-sized range
because one too many bits was flipped on.  Similar 'off by one' logic
was applied when printing 4-dword register ranges and 64-dword register
ranges as well.

Correct the GENMASK logic to print these ranges in debugfs correctly.
No impact outside of correcting the misleading debugfs output.

Fixes: d855d2246ea6 ("drm/xe: Print whitelist while applying")
Reviewed-by: Stuart Summers 
Link: https://patch.msgid.link/20260408-regsr_wl_range-v1-1-e9a28c8b4264@intel.com
Signed-off-by: Matt Roper 
(cherry picked from commit 1a2a722ff96749734a5585dfe7f0bea7719caa8b)
Signed-off-by: Rodrigo Vivi 
Signed-off-by: Sasha Levin

drm/xe: Drop registration of guc_submit_wedged_fini from xe_guc_submit_wedge()

2026-05-23T11:09:34+00:00

[ Upstream commit a0fc362f095330f7b3f68ac0c55ef8da18290c87 ]

xe_guc_submit_wedge() runs in the DMA-fence signaling path, where
GFP_KERNEL memory allocations are not permitted. However, registering
guc_submit_wedged_fini via drmm_add_action_or_reset() triggers such an
allocation.

Avoid this by moving the logic from guc_submit_wedged_fini() into
guc_submit_fini(), where wedged exec queue references are dropped during
normal teardown.

Fixes: 8ed9aaae39f3 ("drm/xe: Force wedged state and block GT reset upon any GPU hang")
Signed-off-by: Matthew Brost 
Reviewed-by: Rodrigo Vivi 
Link: https://patch.msgid.link/20260326210116.202585-3-matthew.brost@intel.com
(cherry picked from commit 4a706bd93c4fb156a13477e26ffdf2e633edeb10)
Signed-off-by: Rodrigo Vivi 
Signed-off-by: Sasha Levin

drm/xe: Use XE_WEDGED_MODE_UPON_ANY_HANG_NO_RESET enum instead of magic number

2026-05-23T11:09:34+00:00

[ Upstream commit a7f607610da721f77db358b09be8091e60bd8e89 ]

Replace the magic number 2 with the proper enum value
XE_WEDGED_MODE_UPON_ANY_HANG_NO_RESET for better code readability
and maintainability.

Signed-off-by: Zhanjun Dong 
Reviewed-by: Matthew Brost 
Signed-off-by: Matthew Brost 
Link: https://patch.msgid.link/20260310225039.1320161-5-zhanjun.dong@intel.com
Stable-dep-of: a0fc362f0953 ("drm/xe: Drop registration of guc_submit_wedged_fini from xe_guc_submit_wedge()")
Signed-off-by: Sasha Levin