linux-stable.git/drivers/gpu/drm/xe, branch linux-6.16.y

drm/xe: Don't copy pinned kernel bos twice on suspend

2025-10-02T11:48:37+00:00

commit 77c8ede611c6a70a95f7b15648551d0121b40d6c upstream.

We were copying the bo content the bos on the list
"xe->pinned.late.kernel_bo_present" twice on suspend.

Presumingly the intent is to copy the pinned external bos on
the first pass.

This is harmless since we (currently) should have no pinned
external bos needing copy since
a) exernal system bos don't have compressed content,
b) We do not (yet) allow pinning of VRAM bos.

Still, fix this up so that we copy pinned external bos on
the first pass. We're about to allow bos pinned in VRAM.

Fixes: c6a4d46ec1d7 ("drm/xe: evict user memory in PM notifier")
Cc: Matthew Auld 
Cc:  # v6.16+
Signed-off-by: Thomas Hellström 
Reviewed-by: Matthew Auld 
Link: https://lore.kernel.org/r/20250918092207.54472-2-thomas.hellstrom@linux.intel.com
(cherry picked from commit 9e69bafece43dcefec864f00b3ec7e088aa7fcbc)
Signed-off-by: Rodrigo Vivi 
Signed-off-by: Greg Kroah-Hartman

Revert "drm/xe/guc: Enable extended CAT error reporting"

2025-10-02T11:48:34+00:00

This reverts commit a7ffcea8631af91479cab10aa7fbfd0722f01d9a.

Reported-by: Iyán Méndez Veiga 
Link: https://lore.kernel.org/stable/aNlW7ekiC0dNPxU3@laps/T/#t
Signed-off-by: Sasha Levin

Revert "drm/xe/guc: Set RCS/CCS yield policy"

2025-10-02T11:48:34+00:00

This reverts commit dd1a415dcfd5984bf83abd804c3cd9e0ff9dde30.

Reported-by: Iyán Méndez Veiga 
Link: https://lore.kernel.org/stable/aNlW7ekiC0dNPxU3@laps/T/#t
Signed-off-by: Sasha Levin

drm/xe: Fix build with CONFIG_MODULES=n

2025-10-02T11:48:33+00:00

[ Upstream commit b67e7422d229dead0dddaad7e7c05558f24d552f ]

When building with CONFIG_MODULES=n, the __exit functions are dropped.
However our init functions may call them for error handling, so they are
not good candidates for the exit sections.

Fix this error reported by 0day:

	ld.lld: error: relocation refers to a symbol in a discarded section: xe_configfs_exit
	>>> defined in vmlinux.a(drivers/gpu/drm/xe/xe_configfs.o)
	>>> referenced by xe_module.c
	>>>               drivers/gpu/drm/xe/xe_module.o:(init_funcs) in archive vmlinux.a

This is the only exit function using __exit. Drop it to fix the build.

Cc: Riana Tauro 
Reported-by: kernel test robot 
Closes: https://lore.kernel.org/oe-kbuild-all/202506092221.1FmUQmI8-lkp@intel.com/
Fixes: 16280ded45fb ("drm/xe: Add configfs to enable survivability mode")
Reviewed-by: Balasubramani Vivekanandan 
Link: https://lore.kernel.org/r/20250912-fix-nomodule-build-v1-1-d11b70a92516@intel.com
Signed-off-by: Lucas De Marchi 
(cherry picked from commit d9b2623319fa20c2206754284291817488329648)
Signed-off-by: Rodrigo Vivi 
Signed-off-by: Sasha Levin

drm/xe/vf: Don't expose sysfs attributes not applicable for VFs

2025-10-02T11:48:33+00:00

[ Upstream commit 500dad428e5b0de4c1bdfa893822a6e06ddad0b5 ]

VFs can't read BMG_PCIE_CAP(0x138340) register nor access PCODE
(already guarded by the info.skip_pcode flag) so we shouldn't
expose attributes that require any of them to avoid errors like:

 [] xe 0000:03:00.1: [drm] Tile0: GT0: VF is trying to read an \
                     inaccessible register 0x138340+0x0
 [] RIP: 0010:xe_gt_sriov_vf_read32+0x6c2/0x9a0 [xe]
 [] Call Trace:
 []  xe_mmio_read32+0x110/0x280 [xe]
 []  auto_link_downgrade_capable_show+0x2e/0x70 [xe]
 []  dev_attr_show+0x1a/0x70
 []  sysfs_kf_seq_show+0xaa/0x120
 []  kernfs_seq_show+0x41/0x60

Fixes: 0e414bf7ad01 ("drm/xe: Expose PCIe link downgrade attributes")
Fixes: cdc36b66cd41 ("drm/xe: Expose fan control and voltage regulator version")
Signed-off-by: Michal Wajdeczko 
Cc: Lucas De Marchi 
Cc: Lukasz Laguna 
Reviewed-by: Raag Jadav 
Reviewed-by: Lucas De Marchi 
Link: https://lore.kernel.org/r/20250916170029.3313-2-michal.wajdeczko@intel.com
(cherry picked from commit a2d6223d224f333f705ed8495bf8bebfbc585c35)
Signed-off-by: Rodrigo Vivi 
Signed-off-by: Sasha Levin

drm/xe/guc: Set RCS/CCS yield policy

2025-09-25T09:16:52+00:00

[ Upstream commit 26caeae9fb482ec443753b4e3307e5122b60b850 ]

All recent platforms (including all the ones officially supported by the
Xe driver) do not allow concurrent execution of RCS and CCS workloads
from different address spaces, with the HW blocking the context switch
when it detects such a scenario.
The DUAL_QUEUE flag helps with this, by causing the GuC to not submit a
context it knows will not be able to execute. This, however, causes a new
problem: if RCS and CCS queues have pending workloads from different
address spaces, the GuC needs to choose from which of the 2 queues to
pick the next workload to execute. By default, the GuC prioritizes RCS
submissions over CCS ones, which can lead to CCS workloads being
significantly (or completely) starved of execution time.
The driver can tune this by setting a dedicated scheduling policy KLV;
this KLV allows the driver to specify a quantum (in ms) and a ratio
(percentage value between 0 and 100), and the GuC will prioritize the CCS
for that percentage of each quantum.
Given that we want to guarantee enough RCS throughput to avoid missing
frames, we set the yield policy to 20% of each 80ms interval.

v2: updated quantum and ratio, improved comment, use xe_guc_submit_disable
in gt_sanitize

Fixes: d9a1ae0d17bd ("drm/xe/guc: Enable WA_DUAL_QUEUE for newer platforms")
Signed-off-by: Daniele Ceraolo Spurio 
Cc: Matthew Brost 
Cc: John Harrison 
Cc: Vinay Belgaumkar 
Reviewed-by: John Harrison 
Tested-by: Vinay Belgaumkar 
Link: https://lore.kernel.org/r/20250905235632.3333247-2-daniele.ceraolospurio@intel.com
(cherry picked from commit 88434448438e4302e272b2a2b810b42e05ea024b)
Signed-off-by: Rodrigo Vivi 
[Rodrigo added #include "xe_guc_submit.h" while backporting]
Signed-off-by: Sasha Levin

drm/xe/guc: Enable extended CAT error reporting

2025-09-25T09:16:52+00:00

[ Upstream commit a7ffcea8631af91479cab10aa7fbfd0722f01d9a ]

On newer HW (Xe2 onwards + PVC) it is possible to get extra information
when a CAT error occurs, specifically a dword reporting the error type.
To enable this extra reporting, we need to opt-in with the GuC, which is
done via a specific per-VF feature opt-in H2G.

On platforms where the HW does not support the extra reporting, the GuC
will set the type to 0xdeadbeef, so we can keep the code simple and
opt-in to the feature on every platform and then just discard the data
if it is invalid.

Note that on native/PF we're guaranteed that the opt in is available
because we don't support any GuC old enough to not have it, but if we're
a VF we might be running on a non-XE PF with an older GuC, so we need to
handle that case. We can re-use the invalid type above to handle this
scenario the same way as if the feature was not supported in HW.

Given that this patch is the first user of the guc_buf_cache on native
and VF, it also extends that feature to non-PF use-cases.

v2: simpler print for the error type (John), rebase
v3: use guc_buf_cache instead of new alloc, simpler doc (Michal)

Signed-off-by: Daniele Ceraolo Spurio 
Cc: Nirmoy Das 
Cc: John Harrison 
Cc: Michal Wajdeczko 
Reviewed-by: Nirmoy Das  #v1
Reviewed-by: Michal Wajdeczko 
Reviewed-by: John Harrison 
Link: https://lore.kernel.org/r/20250625205405.1653212-3-daniele.ceraolospurio@intel.com
Stable-dep-of: 26caeae9fb48 ("drm/xe/guc: Set RCS/CCS yield policy")
Signed-off-by: Sasha Levin

drm/xe: Fix error handling if PXP fails to start

2025-09-25T09:16:52+00:00

[ Upstream commit ae5fbbda341f92e605a9508a0fb18456155517f0 ]

Since the PXP start comes after __xe_exec_queue_init() has completed,
we need to cleanup what was done in that function in case of a PXP
start error.
__xe_exec_queue_init calls the submission backend init() function,
so we need to introduce an opposite for that. Unfortunately, while
we already have a fini() function pointer, it performs other
operations in addition to cleaning up what was done by the init().
Therefore, for clarity, the existing fini() has been renamed to
destroy(), while a new fini() has been added to only clean up what was
done by the init(), with the latter being called by the former (via
xe_exec_queue_fini).

Fixes: 72d479601d67 ("drm/xe/pxp/uapi: Add userspace and LRC support for PXP-using queues")
Signed-off-by: Daniele Ceraolo Spurio 
Cc: John Harrison 
Cc: Matthew Brost 
Reviewed-by: John Harrison 
Signed-off-by: John Harrison 
Link: https://lore.kernel.org/r/20250909221240.3711023-3-daniele.ceraolospurio@intel.com
(cherry picked from commit 626667321deb4c7a294725406faa3dd71c3d445d)
Signed-off-by: Rodrigo Vivi 
Signed-off-by: Sasha Levin

drm/xe: Fix a NULL vs IS_ERR() in xe_vm_add_compute_exec_queue()

2025-09-25T09:16:52+00:00

[ Upstream commit cbc7f3b4f6ca19320e2eacf8fc1403d6f331ce14 ]

The xe_preempt_fence_create() function returns error pointers.  It
never returns NULL.  Update the error checking to match.

Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Dan Carpenter 
Reviewed-by: Matthew Brost 
Link: https://lore.kernel.org/r/aJTMBdX97cof_009@stanley.mountain
Signed-off-by: Rodrigo Vivi 
(cherry picked from commit 75cc23ffe5b422bc3cbd5cf0956b8b86e4b0e162)
Signed-off-by: Rodrigo Vivi 
Signed-off-by: Sasha Levin

drm/xe/pf: Drop rounddown_pow_of_two fair LMEM limitation

2025-09-25T09:16:51+00:00

[ Upstream commit fef8b64e48e836344574b85132a1c317f4260022 ]

This effectively reverts commit 4c3fe5eae46b ("drm/xe/pf: Limit
fair VF LMEM provisioning") since we don't need it any more after
non-contig VRAM allocations were fixed. This allows larger LMEM
auto-provisioning for VFs, so instead:

 [ ] GT0: PF: LMEM available(14096M) fair(1 x 8192M)
 [ ] GT0: PF: VF1 provisioned with 8589934592 (8.00 GiB) LMEM
or
 [ ] GT0: PF: LMEM available(14096M) fair(2 x 4096M)
 [ ] GT0: PF: VF1..VF2 provisioned with 4294967296 (4.00 GiB) LMEM

we may get:

 [ ] GT0: PF: LMEM available(14096M) fair(1 x 14096M)
 [ ] GT0: PF: VF1 provisioned with 14780727296 (13.8 GiB) LMEM
and
 [ ] GT0: PF: LMEM available(14096M) fair(2 x 7048M)
 [ ] GT0: PF: VF1..VF2 provisioned with 7390363648 (6.88 GiB) LMEM

Fixes: 1e32ffbc9dc8 ("drm/xe/sriov: support non-contig VRAM provisioning")
Signed-off-by: Michal Wajdeczko 
Reviewed-by: Piotr Piórkowski 
Link: https://lore.kernel.org/r/20250910222439.32869-1-michal.wajdeczko@intel.com
(cherry picked from commit 95c1cfa306087142989bff34ea0e05dcd95ddc58)
Signed-off-by: Rodrigo Vivi 
Signed-off-by: Sasha Levin