linux-stable.git/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c, branch linux-6.10.y

drm/amdgpu: Fix two reset triggered in a row

2024-09-12T09:13:05+00:00

[ Upstream commit f4322b9f8ad5f9f62add288c785d2e10bb6a5efe ]

Some times a hang GPU causes multiple reset sources to schedule resets.
The second source will be able to trigger an unnecessary reset if they
schedule after we call amdgpu_device_stop_pending_resets.

Move amdgpu_device_stop_pending_resets to after the reset is done. Since
at this point the GPU is supposedly in a good state, any reset scheduled
after this point would be a legitimate reset.

Remove unnecessary and incorrect checks for amdgpu_in_reset that was
kinda serving this purpose.

Signed-off-by: Yunxiang Li 
Reviewed-by: Lijo Lazar 
Signed-off-by: Alex Deucher 
Stable-dep-of: 6e4aa08fa9c6 ("drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic")
Signed-off-by: Sasha Levin

drm/amdgpu: Set no_hw_access when VF request full GPU fails

2024-09-12T09:13:01+00:00

[ Upstream commit 33f23fc3155b13c4a96d94a0a22dc26db767440b ]

[Why]
If VF request full GPU access and the request failed,
the VF driver can get stuck accessing registers for an extended period during
the unload of KMS.

[How]
Set no_hw_access flag when VF request for full GPU access fails
This prevents further hardware access attempts, avoiding the prolonged
stuck state.

Signed-off-by: Yifan Zha 
Acked-by: Alex Deucher 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin

drm/amdgpu: add skip_hw_access checks for sriov

2024-09-08T05:56:38+00:00

[ Upstream commit b3948ad1ac582f560e1f3aeaecf384619921c48d ]

Accessing registers via host is missing the check for skip_hw_access and
the lockdep check that comes with it.

Signed-off-by: Yunxiang Li 
Reviewed-by: Christian König 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin

drm/amdgpu: Queue KFD reset workitem in VF FED

2024-09-08T05:56:31+00:00

[ Upstream commit 5434bc03f52de2ec57d6ce684b1853928f508cbc ]

The guest recovery sequence is buggy in Fatal Error when both
FLR & KFD reset workitems are queued at the same time. In addition,
FLR guest recovery sequence is out of order when PF/VF communication
breaks due to a GPU fatal error

As a temporary work around, perform a KFD style reset (Initiate reset
request from the guest) inside the pf2vf thread on FED.

Signed-off-by: Victor Skvortsov 
Reviewed-by: Zhigang Luo 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin

drm/amdgpu: avoid reading vf2pf info size from FB

2024-09-08T05:56:21+00:00

[ Upstream commit 3bcc0ee14768d886cedff65da72d83d375a31a56 ]

VF can't access FB when host is doing mode1 reset. Using sizeof to get
vf2pf info size, instead of reading it from vf2pf header stored in FB.

Signed-off-by: Zhigang Luo 
Reviewed-by: Hawking Zhang 
Reviewed-by: Lijo Lazar 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin

drm/amdgpu: fix uninitialized scalar variable warning

2024-09-08T05:56:21+00:00

[ Upstream commit 0fa4c25db8b791f79bc0d5a0cd58aff9ad85186b ]

Clear warning that field bp is uninitialized when
calling amdgpu_virt_ras_add_bps.

Signed-off-by: Tim Huang 
Reviewed-by: Yang Wang 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin

drm/amdgpu: Add lock around VF RLCG interface

2024-08-14T13:34:14+00:00

[ Upstream commit e864180ee49b4d30e640fd1e1d852b86411420c9 ]

flush_gpu_tlb may be called from another thread while
device_gpu_recover is running.

Both of these threads access registers through the VF
RLCG interface during VF Full Access. Add a lock around this interface
to prevent race conditions between these threads.

Signed-off-by: Victor Skvortsov 
Reviewed-by: Zhigang Luo 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin

amd/amdgpu: improve VF recover time

2024-04-10T02:14:30+00:00

1. change AMDGPU_VF2PF_UPDATE_MAX_RETRY_LIMIT from 30 to 5.
2. set fatel error detected flag.

Signed-off-by: Zhigang Luo 
Reviewed-by: Lijo Lazar 
Signed-off-by: Alex Deucher

drm/amd/amdgpu: support MES command SET_HW_RESOURCE1 in sriov

2024-04-10T02:08:53+00:00

support MES command SET_HW_RESOURCE1 in sriov

Signed-off-by: chongli2 
Reviewed-by: Jingwen Chen 
Acked-by: Jingwen Chen 
Signed-off-by: Alex Deucher

drm/amdgpu: use vm_update_mode=0 as default in sriov for gfx10.3 onwards

2024-04-10T02:02:37+00:00

Apply this rule to all newer asics in sriov case.
For asic with VF MMIO access protection avoid using CPU for VM table updates.
CPU pagetable updates have issues with HDP flush as VF MMIO access protection
blocks write to BIF_BX_DEV0_EPF0_VF0_HDP_MEM_COHERENCY_FLUSH_CNTL register
during sriov runtime.
Moved the check to amdgpu_device_init() to ensure it is done after
amdgpu_device_ip_early_init() where the IP versions are discovered.

Signed-off-by: Danijel Slivka 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher