linux.git/drivers/gpu/drm/xe/xe_exec_queue.c, branch v7.1-rc2

drm/xe: Fix error cleanup in xe_exec_queue_create_ioctl()

2026-04-29T16:51:20+00:00

Two error handling issues exist in xe_exec_queue_create_ioctl():

1. When xe_hw_engine_group_add_exec_queue() fails, the error path jumps
   to put_exec_queue which skips xe_exec_queue_kill(). If the VM is in
   preempt fence mode, xe_vm_add_compute_exec_queue() has already added
   the queue to the VM's compute exec queue list. Skipping the kill
   leaves the queue on that list, leading to a dangling pointer after
   the queue is freed.

2. When xa_alloc() fails after xe_hw_engine_group_add_exec_queue() has
   succeeded, the error path does not call
   xe_hw_engine_group_del_exec_queue() to remove the queue from the hw
   engine group list. The queue is then freed while still linked into
   the hw engine group, causing a use-after-free.

Fix both by:
- Changing the xe_hw_engine_group_add_exec_queue() failure path to jump
  to kill_exec_queue so that xe_exec_queue_kill() properly removes the
  queue from the VM's compute list.
- Adding a del_hw_engine_group label before kill_exec_queue for the
  xa_alloc() failure path, which removes the queue from the hw engine
  group before proceeding with the rest of the cleanup.

Fixes: 7970cb36966c ("'drm/xe/hw_engine_group: Register hw engine group's exec queues")
Cc: Francois Dugast 
Cc: Matthew Brost 
Cc: Niranjana Vishwanathapura 
Assisted-by: Claude:claude-opus-4.6
Reviewed-by: Matthew Brost 
Link: https://patch.msgid.link/20260408020647.3397933-1-shuicheng.lin@intel.com
Signed-off-by: Shuicheng Lin 
(cherry picked from commit 37c831f401746a45d510b312b0ed7a77b1e06ec8)
Signed-off-by: Rodrigo Vivi

drm/xe: Fix potential NULL deref in xe_exec_queue_tlb_inval_last_fence_put_unlocked

2026-04-29T16:51:19+00:00

xe_exec_queue_tlb_inval_last_fence_put_unlocked() uses q->vm->xe as the
first argument to xe_assert(). This function is called unconditionally
from xe_exec_queue_destroy() for all queues, including kernel queues
that have q->vm == NULL (e.g., queues created during GT init in
xe_gt_record_default_lrcs() with vm=NULL).

While current compilers optimize away the q->vm->xe dereference (even
in CONFIG_DRM_XE_DEBUG=y builds, the compiler pushes the dereference
into the WARN branch that is only taken when the assert condition is
false), the code is semantically incorrect and constitutes undefined
behavior in the C abstract machine for the NULL pointer case.

Use gt_to_xe(q->gt) instead, which is always valid for any exec queue.
This is consistent with how xe_exec_queue_destroy() itself obtains the
xe_device pointer in its own xe_assert at the top of the function.

Fixes: b2d7ec41f2a3 ("drm/xe: Attach last fence to TLB invalidation job queues")
Assisted-by: Claude:claude-opus-4.6
Reviewed-by: Matthew Brost 
Link: https://patch.msgid.link/20260409003449.3405767-1-shuicheng.lin@intel.com
Signed-off-by: Shuicheng Lin 
(cherry picked from commit 96078a1c68bf97f17fd1d08c3f58f5c5cc9ccd65)
Signed-off-by: Rodrigo Vivi

Merge drm/drm-next into drm-xe-next

2026-03-12T14:23:23+00:00

Backmerging to bring in 7.00-rc3. Important ahead GPU SVM merging THP
support.

Signed-off-by: Matthew Brost

drm/xe: Allow per queue programming of COMMON_SLICE_CHICKEN3 bit13

2026-03-10T13:45:10+00:00

Similar to i915's commit cebc13de7e704b1355bea208a9f9cdb042c74588
("drm/i915: Whitelist COMMON_SLICE_CHICKEN3 for UMD access"), except
that instead of putting the register on the allowlist for UMD to
program, the KMD is doing the programming at context initialization
based on a queue creation flag.

This is a recommended tuning setting for both gen12 and Xe_HP
platforms.

If a render queue is created with
DRM_XE_EXEC_QUEUE_SET_STATE_CACHE_PERF_FIX, COMMON_SLICE_CHICKEN3 will
be programmed at initialization to enable the render color cache to
key with BTP+BTI (binding table pool + binding table entry) instead of
just BTI (binding table entry). This enables the UMD to avoid emitting
render-target-cache-flush + stall-at-pixel-scoreboard every time a
binding table entry pointing to a render target is changed.

v2: Use xe_lrc_write_ring()

v3: Update xe_query.c to report availability

v4: Rename defines to add DISABLE_

v5: update commit message

v6: rebase

Mesa MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/39982

Bspec: 73993, 73994, 72161, 31870, 68331
Acked-by: Rodrigo Vivi 
Reviewed-by: José Roberto de Souza 
Signed-off-by: Lionel Landwerlin 
Signed-off-by: José Roberto de Souza 
Link: https://patch.msgid.link/20260306075504.1288676-1-lionel.g.landwerlin@intel.com

drm/xe: Add missing kernel docs in xe_exec_queue.c

2026-03-04T18:42:43+00:00

Add kernel doc to all exported functions that do not have
a kernel doc in xe_exec_queue.c.

v2: mention multi-lrc in comment (Matt Brost)

Assisted-by: Claude 4.5 Sonnet
Signed-off-by: Niranjana Vishwanathapura 
Reviewed-by: Matthew Brost 
Link: https://patch.msgid.link/20260303223040.140504-2-niranjana.vishwanathapura@intel.com
Link: https://patch.msgid.link/20260303223040.140504-2-niranjana.vishwanathapura@intel.com

Merge tag 'drm-xe-next-2026-03-02' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-next

2026-03-03T00:37:29+00:00

UAPI Changes:
- restrict multi-lrc to VCS/VECS engines (Xin Wang)
- Introduce a flag to disallow vm overcommit in fault mode (Thomas)
- update used tracking kernel-doc (Auld, Fixes)
- Some bind queue fixes (Auld, Fixes)

Cross-subsystem Changes:
- Split drm_suballoc_new() into SA alloc and init helpers (Satya, Fixes)
- pass pagemap_addr by reference (Arnd, Fixes)
- Revert "drm/pagemap: Disable device-to-device migration" (Thomas)
- Fix unbalanced unlock in drm_gpusvm_scan_mm (Maciej, Fixes)
- Small GPUSVM fixes (Brost, Fixes)
- Fix xe SVM configs (Thomas, Fixes)

Core Changes:
- Fix a hmm_range_fault() livelock / starvation problem (Thomas, Fixes)

Driver Changes:
- Fix leak on xa_store failure (Shuicheng, Fixes)
- Correct implementation of Wa_16025250150 (Roper, Fixes)
- Refactor context init into xe_lrc_ctx_init (Raag)
- Fix GSC proxy cleanup on early initialization failure (Zhanjun)
- Fix exec queue creation during post-migration recovery (Tomasz, Fixes)
- Apply windower hardware filtering setting on Xe3 and Xe3p (Roper)
- Free ctx_restore_mid_bb in release (Shuicheng, Fixes)
- Drop stale MCR steering TODO comment (Roper)
- dGPU memory optimizations (Brost)
- Do not preempt fence signaling CS instructions (Brost, Fixes)
- Revert "drm/xe/compat: Remove unused i915_reg.h from compat header" (Uma)
- Don't expose display modparam if no display support (Wajdeczko)
- Some VRAM flag improvements (Wajdeczko)
- Misc fix for xe_guc_ct.c (Shuicheng, Fixes)
- Remove unused i915_reg.h from compat header (Uma)
- Workaround cleanup & simplification (Roper)
- Add prefetch pagefault support for Xe3p (Varun)
- Fix fs_reclaim deadlock caused by CCS save/restore (Satya, Fixes)
- Cleanup partially initialized sync on parse failure (Shuicheng, Fixes)
- Allow to change VFs VRAM quota using sysfs (Michal)
- Increase GuC log sizes in debug builds (Tomasz)
- Wa_18041344222 changes (Harish)
- Add Wa_14026781792 (Niton)
- Add debugfs facility to catch RTP mistakes (Roper)
- Convert GT stats to per-cpu counters (Brost)
- Prevent unintended VRAM channel creation (Karthik)
- Privatize struct xe_ggtt (Maarten)
- remove unnecessary struct dram_info forward declaration (Jani)
- pagefault refactors (Brost)
- Apply Wa_14024997852 (Arvind)
- Redirect faults to dummy page for wedged device (Raag, Fixes)
- Force EXEC_QUEUE_FLAG_KERNEL for kernel internal VMs (Piotr)
- Stop applying Wa_16018737384 from Xe3 onward (Roper)
- Add new XeCore fuse registers to VF runtime regs (Roper)
- Update xe_device_declare_wedged() error log (Raag)
- Make xe_modparam.force_vram_bar_size signed (Shuicheng, Fixes)
- Avoid reading media version when media GT is disabled (Piotr, Fixes)
- Fix handling of Wa_14019988906 & Wa_14019877138 (Roper, Fixes)
- Basic enabling patches for Xe3p_LPG and NVL-P (Gustavo, Roper, Shekhar)
- Avoid double-adjust in 64-bit reads (Shuicheng, Fixes)
- Allow VF to initialize MCR tables (Wajdeczko)
- Add Wa_14025883347 for GuC DMA failure on reset (Anirban)
- Add bounds check on pat_index to prevent OOB kernel read in madvise (Jia, Fixes)
- Fix the address range assert in ggtt_get_pte helper (Winiarski)
- XeCore fuse register changes (Roper)
- Add more info to powergate_info debugfs (Vinay)
- Separate out GuC RC code (Vinay)
- Fix g2g_test_array indexing (Pallavi)
- Mutual exclusivity between CCS-mode and PF (Nareshkumar, Fixes)
- Some more _types.h cleanups (Wajdeczko)
- Fix sysfs initialization (Wajdeczko, Fixes)
- Drop unnecessary goto in xe_device_create (Roper)
- Disable D3Cold for BMG only on specific platforms (Karthik, Fixes)
- Add sriov.admin_only_pf attribute (Wajdeczko)
- replace old wq(s), add WQ_PERCPU to alloc_workqueue (Marco)
- Make MMIO communication more robust (Wajdeczko)
- Fix warning of kerneldoc (Shuicheng, Fixes)
- Fix topology query pointer advance (Shuicheng, Fixes)
- use entry_dump callbacks for xe2+ PAT dumps (Xin Wang)
- Fix kernel-doc warning in GuC scheduler ABI header (Chaitanya, Fixes)
- Fix CFI violation in debugfs access (Daniele, Fixes)
- Apply WA_16028005424 to Media (Balasubramani)
- Fix typo in function kernel-doc (Wajdeczko)
- Protect priority against concurrent access (Niranjana)
- Fix nvm aux resource cleanup (Shuicheng, Fixes)
- Fix is_bound() pci_dev lifetime (Shuicheng, Fixes)
- Use CLASS() for forcewake in xe_gt_enable_comp_1wcoh (Shuicheng)
- Reset VF GuC state on fini (Wajdeczko)
- Move _THIS_IP_ usage from xe_vm_create() to dedicated function (Nathan Chancellor, Fixes)
- Unregister drm device on probe error (Shuicheng, Fixes)
- Disable DCC on PTL (Vinay, Fixes)
- Fix Wa_18022495364 (Tvrtko, Fixes)
- Skip address copy for sync-only execs (Shuicheng, Fixes)
- derive mem copy capability from graphics version (Nitin, Fixes)
- Use DRM_BUDDY_CONTIGUOUS_ALLOCATION for contiguous allocations (Sanjay)
- Context based TLB invalidations (Brost)
- Enable multi_queue on xe3p_xpc (Brost, Niranjana)
- Remove check for gt in xe_query (Nakshtra)
- Reduce LRC timestamp stuck message on VFs to notice (Brost, Fixes)

Signed-off-by: Dave Airlie 

From: Matthew Brost 
Link: https://patch.msgid.link/aaYR5G2MHjOEMXPW@lstrano-desk.jf.intel.com

drm/xe/vf: Redo LRC creation while in VF fixups

2026-02-27T17:02:07+00:00

If the xe module within a VM was creating a new LRC during save/
restore, this LRC will be invalid. The fixups procedure may not
be able to reach it, as there will be a race to add the new LRC
reference to an exec queue.

Even if the new LRC which was being created during VM migration is
added to EQ in time for fixups, said LRC may still remain damaged.
In a small percentage of specially crafted test cases, the resulting
LRC was still damaged and caused GPU hang.

Any LRC which could be created in such a situation, have to be
re-created.

Due to VM having arbitrarily set amount of CPU cores, it is possible
to limit the amount to 1. In such case, there is a possibility that
kernel will switch CPU contexts in a way which allows to miss
VF migration recovery running in parallel (by simply not switching
to the LRC creation thread during recovery). Therefore checking
if the migration is in progress just after LRC creation, is not
enough to ensure detection.

Free the incorrectly created LRC, and trigger a re-run of the
creation, but only after waiting for default LRC to get fixups.
Use additional atomic value increased after fixups, to ensure any VF
migration that avoided detection by just checking for recovery in
progress, will be caught.

v2: Merge marker and wait for default LRC, reducing amount of calls
  within xe_init_eq(). Alter the LRC creation loop to remove a race
  with post-migration fixups worker.
v3: Kerneldoc fixes. Rename fixups_complete_count.

Signed-off-by: Tomasz Lis 
Reviewed-by: Matthew Brost 
Signed-off-by: Michal Wajdeczko 
Link: https://patch.msgid.link/20260226212701.2937065-5-tomasz.lis@intel.com

drm/xe: Wrappers for setting and getting LRC references

2026-02-27T17:02:04+00:00

There is a small but non-zero chance that VF post migration fixups
are running on an exec queue during teardown. The chances are
decreased by starting the teardown by releasing guc_id, but remain
non-zero. On the other hand the sync between fixups and EQ creation
(wait_valid_ggtt) drastically increases the chance for such parallel
teardown if queue creation error path is entered (err_lrc label).

The exec queue itself is not going to cause an issue, but LRCs have
a small chance of getting freed during the fixups.

Creating a setter and a getter makes it easier to protect the fixup
operations with a lock. For other driver activities, the original
access method (without any protection) can still be used.

v2: Separate lock, only for LRCs. Kerneldoc fixes. Subject tag fix.

Signed-off-by: Tomasz Lis 
Reviewed-by: Matthew Brost 
Signed-off-by: Michal Wajdeczko 
Link: https://patch.msgid.link/20260226212701.2937065-3-tomasz.lis@intel.com

drm/xe/queue: Call fini on exec queue creation fail

2026-02-27T17:02:03+00:00

Every call to queue init should have a corresponding fini call.
Skipping this would mean skipping removal of the queue from GuC list
(which is part of guc_id allocation). A damaged queue stored in
exec_queue_lookup list would lead to invalid memory reference,
sooner or later.

Call fini to free guc_id. This must be done before any internal
LRCs are freed.

Since the finalization with this extra call became very similar to
__xe_exec_queue_fini(), reuse that. To make this reuse possible,
alter xe_lrc_put() so it can survive NULL parameters, like other
similar functions.

v2: Reuse _xe_exec_queue_fini(). Make xe_lrc_put() aware of NULLs.

Fixes: 3c1fa4aa60b1 ("drm/xe: Move queue init before LRC creation")
Signed-off-by: Tomasz Lis 
Reviewed-by: Matthew Brost  (v1)
Signed-off-by: Michal Wajdeczko 
Link: https://patch.msgid.link/20260226212701.2937065-2-tomasz.lis@intel.com

drm/xe: restrict multi-lrc to VCS/VECS engines

2026-02-27T16:58:02+00:00

Tighten uapi validation to restrict multi-lrc support to VIDEO_DECODE and
VIDEO_ENHANCE engines only. This check should have been in place from the
start, as the driver typically avoids allowing uapi cases that we have
no userspace consumer for.

Additionally, the GuC firmware on ModSched platforms no longer supports
multi-lrc on non-media engines.

V4:
 - use a unified mask for all platforms since engine instance count
   is an independent runtime check (Matt Roper, Matthew Brost)

V3:
 - store a multi-lrc enable class mask in xe->info and populate from
   xe_device_desc in xe_pci.c (Matthew Brost)

V2:
 - correct the typo (Shuicheng)
 - move the check earlier to avoid VM lookup (Shuicheng, Matt Roper)
 - remove the graphics version check (Matt Roper)
 - input more details in the commit info (Matt Roper)

Cc: Shuicheng Lin 
Cc: Matt Roper 
Cc: Matthew Brost 
Signed-off-by: Xin Wang 
Reviewed-by: Shuicheng Lin 
Reviewed-by: Matt Roper 
Link: https://patch.msgid.link/20260225022014.45394-1-x.wang@intel.com
Signed-off-by: Matt Roper