linux-stable.git/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c, branch linux-6.16.y

drm/amdgpu: Retain job->vm in amdgpu_job_prepare_job

2025-08-28T14:34:30+00:00

commit c00d8b79fd2167c6ac65e096619535acdf8678d5 upstream.

The field job->vm is used in function amdgpu_job_run to get the page
table re-generation counter and decide whether the job should be skipped.

Specifically, function amdgpu_vm_generation checks if the VM is valid for this job to use.
For instance, if a gfx job depends on a cancelled sdma job from entity vm->delayed,
then the gfx job should be skipped.

Fixes: 26c95e838e63 ("drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepare")
Signed-off-by: YuanShang 
Reviewed-by: Alex Deucher 
Signed-off-by: Alex Deucher 
(cherry picked from commit ed76936c6b10b547c6df4ca75412331e9ef6d339)
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman

drm/amdgpu: move force completion into ring resets

2025-08-15T14:38:43+00:00

[ Upstream commit 2dee58ca471dae05c473270d0fb74efe01a78ccb ]

Move the force completion handling into each ring
reset function so that each engine can determine
whether or not it needs to force completion on the
jobs in the ring.

Reviewed-by: Christian König 
Signed-off-by: Alex Deucher 
Stable-dep-of: 14b2d71a9a24 ("drm/amdgpu/gfx10: fix KGQ reset sequence")
Signed-off-by: Sasha Levin

drm/amdgpu: rework queue reset scheduler interaction

2025-08-15T14:38:43+00:00

[ Upstream commit 821aacb2dcf0d1fbc3c0f7803b6089b01addb8bf ]

Stopping the scheduler for queue reset is generally a good idea because
it prevents any worker from touching the ring buffer.

Reviewed-by: Alex Deucher 
Signed-off-by: Christian König 
Signed-off-by: Alex Deucher 
Stable-dep-of: 14b2d71a9a24 ("drm/amdgpu/gfx10: fix KGQ reset sequence")
Signed-off-by: Sasha Levin

drm/amdgpu: switch job hw_fence to amdgpu_fence

2025-06-18T17:15:26+00:00

Use the amdgpu fence container so we can store additional
data in the fence.  This also fixes the start_time handling
for MCBP since we were casting the fence to an amdgpu_fence
and it wasn't.

Fixes: 3f4c175d62d8 ("drm/amdgpu: MCBP based on DRM scheduler (v9)")
Reviewed-by: Christian König 
Signed-off-by: Alex Deucher 
(cherry picked from commit bf1cd14f9e2e1fdf981eed273ddd595863f5288c)
Cc: stable@vger.kernel.org

drm/amdgpu: rework how isolation is enforced v2

2025-03-21T16:16:34+00:00

Limiting the number of available VMIDs to enforce isolation causes some
issues with gang submit and applying certain HW workarounds which
require multiple VMIDs to work correctly.

So instead start to track all submissions to the relevant engines in a
per partition data structure and use the dma_fences of the submissions
to enforce isolation similar to what a VMID limit does.

v2: use ~0l for jobs without isolation to distinct it from kernel
    submissions which uses NULL for the owner. Add some warning when we
    are OOM.

Signed-off-by: Christian König 
Acked-by: Srinivasan Shanmugam 
Signed-off-by: Alex Deucher

drm/amdgpu: Trigger a wedged event for ring reset

2025-03-10T18:56:25+00:00

Instead of only triggering a wedged event for complete GPU resets,
trigger for ring resets. Regardless of the reset, it's useful for
userspace to know that it happened because the kernel will reject
further submissions from that app.

Reviewed-by: Christian König 
Signed-off-by: André Almeida 
Signed-off-by: Alex Deucher

Merge tag 'amd-drm-next-6.15-2025-03-07' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

2025-03-09T23:04:52+00:00

amdgpu:
- Fix spelling typos
- RAS updates
- VCN 5.0.1 updates
- SubVP fixes
- DCN 4.0.1 fixes
- MSO DPCD fixes
- DIO encoder refactor
- PCON fixes
- Misc cleanups
- DMCUB fixes
- USB4 DP fixes
- DM cleanups
- Backlight cleanups and fixes
- Support platform backlight curves
- Misc code cleanups
- SMU 14 fixes
- JPEG 4.0.3 reset updates
- SR-IOV fixes
- SVM fixes
- GC 12 DCC fixes
- DC DCE 6.x fix
- Hiberation fix

amdkfd:
- Fix possible NULL pointer in queue validation
- Remove unnecessary CP domain validation
- SDMA queue reset support
- Add per process flags

radeon:
- Fix spelling typos
- RS400 hyperZ fix

UAPI:
- Add KFD per process flags for setting precision
  Proposed user space: https://github.com/ROCm/ROCR-Runtime/commit/2a64fa5e06e80e0af36df4ce0c76ae52eeec0a9d

Signed-off-by: Dave Airlie 

From: Alex Deucher 
Link: https://patchwork.freedesktop.org/patch/msgid/20250307211051.1880472-1-alexander.deucher@amd.com

drm/amdgpu: Create a debug option to disable ring reset

2025-02-27T21:50:04+00:00

Prior to the addition of ring reset, the debug option
`debug_disable_soft_recovery` could be used to force a full device
reset. Now that we have ring reset, create a debug option to disable
them in amdgpu, forcing the driver to go with the full device
reset path again when both options are combined.

This option is useful for testing and debugging purposes when one wants
to test the full reset from userspace.

Signed-off-by: André Almeida 
Signed-off-by: Alex Deucher

drm/amdgpu: Log after a successful ring reset

2025-02-25T16:44:00+00:00

When a ring reset happens, the kernel log shows only "amdgpu: Starting
 ring reset", but when it finishes nothing appears in the
log. Explicitly write in the log that the reset has finished correctly.

Reviewed-by: Christian König 
Signed-off-by: André Almeida 
Signed-off-by: Alex Deucher

drm/amdgpu: Update amdgpu_job_timedout to check if the ring is guilty

2025-02-25T16:43:59+00:00

This patch updates the `amdgpu_job_timedout` function to check if
the ring is actually guilty of causing the timeout. If not, it
skips error handling and fence completion.

v2: move the is_guilty check down into the queue reset area (Alex)
v3: need to call is_guilty before reset (Alex)
v4: squash in is_guilty logic fixes (Alex)

Signed-off-by: Alex Deucher 
Signed-off-by: Jesse Zhang 
Reviewed-by: Alex Deucher 
Signed-off-by: Alex Deucher