linux.git/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c, branch v6.0-rc2

drm/amdgpu: Avoid another list of reset devices

2022-08-10T19:07:14+00:00

A list of devices to be reset is already created in
amdgpu_device_gpu_recover function. Creating another list with the
same nodes is incorrect and not supported in list_head. Instead, pass
the device list as part of reset context.

Fixes: 9e08564727fc (drm/amdgpu: Refactor mode2 reset logic for v13.0.2)
Signed-off-by: Lijo Lazar 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: move mes self test after drm sched re-started

2022-07-28T20:28:54+00:00

mes self test rely on vm mapping, move it after
drm sched re-started so that vm mapping can work
during gpu reset.

Signed-off-by: Jack Xiao 
Acked-and-tested-by: Evan Quan 
Signed-off-by: Alex Deucher

drm/amdgpu: drop non-necessary call trace dump

2022-07-28T20:28:54+00:00

This extra call trace dump comes out in every gpu reset.
And it gives people a wrong impression that something
went wrong. Although actually there was not.

Signed-off-by: Evan Quan 
Acked-by: Christian König 
Signed-off-by: Alex Deucher

drm/amdgpu: Get rid of amdgpu_job->external_hw_fence

2022-07-18T20:37:25+00:00

This is a follow-up cleanup to [1]. See bellow refcount balancing
for calling amdgpu_job_submit_direct after this cleanup as far
as I calculated.

amdgpu_fence_emit
	dma_fence_init 1
	dma_fence_get(fence) 2
	rcu_assign_pointer(*ptr, dma_fence_get(fence) 3

---> amdgpu_job_submit_direct completes before fence signaled
			amdgpu_sa_bo_free
				(*sa_bo)->fence = dma_fence_get(fence) 4

			amdgpu_job_free
				dma_fence_put 3

			amdgpu_vcn_enc_get_destroy_msg
				*fence = dma_fence_get(f) 4
				dma_fence_put(f); 3

			amdgpu_vcn_enc_ring_test_ib
				dma_fence_put(fence) 2

			amdgpu_fence_process
				dma_fence_put 1

			amdgpu_sa_bo_remove_locked
				dma_fence_put 0

---> amdgpu_job_submit_direct completes after fence signaled
			amdgpu_fence_process
				dma_fence_put 2

			amdgpu_job_free
				dma_fence_put 1

			amdgpu_vcn_enc_get_destroy_msg
				*fence = dma_fence_get(f) 2
				dma_fence_put(f); 1

			amdgpu_vcn_enc_ring_test_ib
				dma_fence_put(fence) 0

[1] - https://patchwork.kernel.org/project/dri-devel/cover/20220624180955.485440-1-andrey.grodzovsky@amd.com/

Signed-off-by: Andrey Grodzovsky 
Suggested-by: Christian König 
Signed-off-by: Alex Deucher

drm/amdgpu: support reset flag set for gpu reset

2022-07-13T15:25:17+00:00

Move reset_context out of gpu recover function to make it configurable
for different reset purpose.
For the reset way of call gpu_recovery sysfs, force to use full reset
method. Otherwise, try soft reset by default if the related ASIC
supportted, if soft reset failed, will use full reset.

Signed-off-by: Likun Gao 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: fix documentation warning

2022-06-30T02:03:51+00:00

Fixes this issue:
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:5094: warning: expecting prototype for amdgpu_device_gpu_recover_imp(). Prototype was for amdgpu_device_gpu_recover() instead

Fixes: cf727044144d ("drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover")
Reviewed-by: Kent Russell 
Reported-by: Stephen Rothwell 
Signed-off-by: Alex Deucher

drm/amdgpu: Fix typos in amdgpu_stop_pending_resets

2022-06-29T21:10:21+00:00

Change amdggpu to amdgpu and pedning to pending

Signed-off-by: Kent Russell 
Reviewed-by: Alex Deucher 
Signed-off-by: Alex Deucher

drm/amdgpu: Follow up change to previous drm scheduler change.

2022-06-28T15:24:41+00:00

Align refcount behaviour for amdgpu_job embedded HW fence with
classic pointer style HW fences by increasing refcount each
time emit is called so amdgpu code doesn't need to make workarounds
using amdgpu_job.job_run_counter to keep the HW fence refcount balanced.

Also since in the previous patch we resumed setting s_fence->parent to NULL
in drm_sched_stop switch to directly checking if job->hw_fence is
signaled to short circuit reset if already signed.

Signed-off-by: Andrey Grodzovsky 
Tested-by: Yiqing Yao 
Acked-by: Christian König 
Signed-off-by: Alex Deucher

drm/amdgpu: Prevent race between late signaled fences and GPU reset.

2022-06-28T15:24:24+00:00

Problem:
After we start handling timed out jobs we assume there fences won't be
signaled but we cannot be sure and sometimes they fire late. We need
to prevent concurrent accesses to fence array from
amdgpu_fence_driver_clear_job_fences during GPU reset and amdgpu_fence_process
from a late EOP interrupt.

Fix:
Before accessing fence array in GPU disable EOP interrupt and flush
all pending interrupt handlers for amdgpu device's interrupt line.

v2: Switch from irq_get/put to full enable/disable_irq for amdgpu

Signed-off-by: Andrey Grodzovsky 
Acked-by: Christian König 
Signed-off-by: Alex Deucher

drm/amdgpu: fix adev variable used in amdgpu_device_gpu_recover()

2022-06-21T22:17:24+00:00

Use the correct adev variable for the drm_fb_helper in
amdgpu_device_gpu_recover().  Noticed by inspection.

Fixes: 087451f372bf ("drm/amdgpu: use generic fb helpers instead of setting up AMD own's.")
Reviewed-by: Guchun Chen 
Signed-off-by: Alex Deucher