linux-stable.git/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c, branch v5.15.2

drm/amdgpu: revert "Add autodump debugfs node for gpu reset v8"

2021-11-06T13:13:31+00:00

commit c8365dbda056578eebe164bf110816b1a39b4b7f upstream.

This reverts commit 728e7e0cd61899208e924472b9e641dbeb0775c4.

Further discussion reveals that this feature is severely broken
and needs to be reverted ASAP.

GPU reset can never be delayed by userspace even for debugging or
otherwise we can run into in kernel deadlocks.

Signed-off-by: Christian König 
Acked-by: Alex Deucher 
Acked-by: Nirmoy Das 
Signed-off-by: Alex Deucher 
Signed-off-by: Greg Kroah-Hartman

drm/amdkfd: fix boot failure when iommu is disabled in Picasso.

2021-11-06T13:13:30+00:00

commit afd18180c07026f94a80ff024acef5f4159084a4 upstream.

When IOMMU disabled in sbios and kfd in iommuv2 path, iommuv2
init will fail. But this failure should not block amdgpu driver init.

Reported-by: youling 
Tested-by: youling 
Signed-off-by: Yifan Zhang 
Reviewed-by: James Zhu 
Signed-off-by: Alex Deucher 
Signed-off-by: Greg Kroah-Hartman

drm/amdgpu: handle the case of pci_channel_io_frozen only in amdgpu_pci_resume

2021-10-05T17:02:31+00:00

In current code, when a PCI error state pci_channel_io_normal is detectd,
it will report PCI_ERS_RESULT_CAN_RECOVER status to PCI driver, and PCI
driver will continue the execution of PCI resume callback report_resume by
pci_walk_bridge, and the callback will go into amdgpu_pci_resume
finally, where write lock is releasd unconditionally without acquiring
such lock first. In this case, a deadlock will happen when other threads
start to acquire the read lock.

To fix this, add a member in amdgpu_device strucutre to cache
pci_channel_state, and only continue the execution in amdgpu_pci_resume
when it's pci_channel_io_frozen.

Fixes: c9a6b82f45e2 ("drm/amdgpu: Implement DPC recovery")
Suggested-by: Andrey Grodzovsky 
Signed-off-by: Guchun Chen 
Reviewed-by: Andrey Grodzovsky 
Signed-off-by: Alex Deucher

drm/amdgpu: init iommu after amdkfd device init

2021-10-05T17:02:20+00:00

This patch is to fix clinfo failure in Raven/Picasso:

Number of platforms: 1
  Platform Profile: FULL_PROFILE
  Platform Version: OpenCL 2.2 AMD-APP (3364.0)
  Platform Name: AMD Accelerated Parallel Processing
  Platform Vendor: Advanced Micro Devices, Inc.
  Platform Extensions: cl_khr_icd cl_amd_event_callback

  Platform Name: AMD Accelerated Parallel Processing Number of devices: 0

Signed-off-by: Yifan Zhang 
Reviewed-by: James Zhu 
Tested-by: James Zhu 
Acked-by: Felix Kuehling 
Signed-off-by: Alex Deucher

drm/amdgpu: move iommu_resume before ip init/resume

2021-09-16T13:56:24+00:00

Separate iommu_resume from kfd_resume, and move it before
other amdgpu ip init/resume.

Bug: https://bugzilla.kernel.org/show_bug.cgi?id=211277
Signed-off-by: James Zhu 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher 
Cc: stable@vger.kernel.org

drm/amdgpu: Cancel delayed work when GFXOFF is disabled

2021-08-20T16:09:44+00:00

schedule_delayed_work does not push back the work if it was already
scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
was disabled and re-enabled again during those 100 ms.

This resulted in frame drops / stutter with the upcoming mutter 41
release on Navi 14, due to constantly enabling GFXOFF in the HW and
disabling it again (for getting the GPU clock counter).

To fix this, call cancel_delayed_work_sync when the disable count
transitions from 0 to 1, and only schedule the delayed work on the
reverse transition, not if the disable count was already 0. This makes
sure the delayed work doesn't run at unexpected times, and allows it to
be lock-free.

v2:
* Use cancel_delayed_work_sync & mutex_trylock instead of
  mod_delayed_work.
v3:
* Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)
v4:
* Fix race condition between amdgpu_gfx_off_ctrl incrementing
  adev->gfx.gfx_off_req_count and amdgpu_device_delay_enable_gfx_off
  checking for it to be 0 (Evan Quan)

Cc: stable@vger.kernel.org
Reviewed-by: Evan Quan 
Reviewed-by: Lijo Lazar  # v3
Acked-by: Christian König  # v3
Signed-off-by: Michel Dänzer 
Signed-off-by: Alex Deucher

drm/amd/amdgpu:flush ttm delayed work before cancel_sync

2021-08-18T22:23:00+00:00

[Why]
In some cases when we unload driver, warning call trace
will show up in vram_mgr_fini which claims that LRU is not empty, caused
by the ttm bo inside delay deleted queue.

[How]
We should flush delayed work to make sure the delay deleting is done.

Signed-off-by: YuBiao Wang 
Reviewed-by: Andrey Grodzovsky 
Signed-off-by: Alex Deucher

drm/amd/amdgpu embed hw_fence into amdgpu_job

2021-08-16T19:16:58+00:00

Why: Previously hw fence is alloced separately with job.
It caused historical lifetime issues and corner cases.
The ideal situation is to take fence to manage both job
and fence's lifetime, and simplify the design of gpu-scheduler.

How:
We propose to embed hw_fence into amdgpu_job.
1. We cover the normal job submission by this method.
2. For ib_test, and submit without a parent job keep the
legacy way to create a hw fence separately.
v2:
use AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT to show that the fence is
embedded in a job.
v3:
remove redundant variable ring in amdgpu_job
v4:
add tdr sequence support for this feature. Add a job_run_counter to
indicate whether this job is a resubmit job.
v5
add missing handling in amdgpu_fence_enable_signaling

Signed-off-by: Jingwen Chen 
Signed-off-by: Jack Zhang 
Reviewed-by: Andrey Grodzovsky 
Reviewed by: Monk Liu 
Signed-off-by: Alex Deucher

drm/amd/amdgpu: skip locking delayed work if not initialized.

2021-08-09T19:42:30+00:00

When init failed in early init stage, amdgpu_object has
not been initialized, so hasn't the ttm delayed queue functions.

Signed-off-by: YuBiao Wang 
Reviewed-by: Emily.Deng 
Signed-off-by: Alex Deucher

drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)

2021-08-06T01:17:58+00:00

In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to stop
scheduler in s3 test, otherwise, fence related failure will arrive
after resume. To fix this and for a better clean up, move drm_sched_fini
from fence_hw_fini to fence_sw_fini, as it's part of driver shutdown, and
should never be called in hw_fini.

v2: rename amdgpu_fence_driver_init to amdgpu_fence_driver_sw_init,
to keep sw_init and sw_fini paired.

Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1668
Fixes: 8d35a2596164c1 ("drm/amdgpu: adjust fence driver enable sequence")
Suggested-by: Christian König 
Tested-by: Mike Lothian 
Signed-off-by: Guchun Chen 
Reviewed-by: Christian König 
Signed-off-by: Alex Deucher