linux-stable.git/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c, branch linux-6.10.y

drm/amdgpu: Add reset_context flag for host FLR

2024-09-12T09:13:05+00:00

[ Upstream commit 25c01191c2555351922e5515b6b6d31357975031 ]

There are other reset sources that pass NULL as the job pointer, such as
amdgpu_amdkfd_reset_work. Therefore, using the job pointer to check if
the FLR comes from the host does not work.

Add a flag in reset_context to explicitly mark host triggered reset, and
set this flag when we receive host reset notification.

Signed-off-by: Yunxiang Li 
Reviewed-by: Emily Deng 
Reviewed-by: Zhigang Luo 
Signed-off-by: Alex Deucher 
Stable-dep-of: 6e4aa08fa9c6 ("drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic")
Signed-off-by: Sasha Levin

drm/amdgpu: Fix two reset triggered in a row

2024-09-12T09:13:05+00:00

[ Upstream commit f4322b9f8ad5f9f62add288c785d2e10bb6a5efe ]

Some times a hang GPU causes multiple reset sources to schedule resets.
The second source will be able to trigger an unnecessary reset if they
schedule after we call amdgpu_device_stop_pending_resets.

Move amdgpu_device_stop_pending_resets to after the reset is done. Since
at this point the GPU is supposedly in a good state, any reset scheduled
after this point would be a legitimate reset.

Remove unnecessary and incorrect checks for amdgpu_in_reset that was
kinda serving this purpose.

Signed-off-by: Yunxiang Li 
Reviewed-by: Lijo Lazar 
Signed-off-by: Alex Deucher 
Stable-dep-of: 6e4aa08fa9c6 ("drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic")
Signed-off-by: Sasha Levin

drm/amdgpu: trigger flr_work if reading pf2vf data failed

2024-03-20T17:38:13+00:00

if reading pf2vf data failed 30 times continuously, it means something is
wrong. Need to trigger flr_work to recover the issue.

also use dev_err to print the error message to get which device has
issue and add warning message if waiting IDH_FLR_NOTIFICATION_CMPL
timeout.

Signed-off-by: Zhigang Luo 
Acked-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: Support passing poison consumption ras block to SRIOV

2024-01-25T19:58:03+00:00

Support passing poison consumption ras blocks
to SRIOV.

Signed-off-by: YiPeng Chai 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: add RAS poison consumption handler for AI SRIOV

2022-12-15T17:18:19+00:00

Send message to host and host will handle it.

v2: split the patch into two parts, one is for mxgpu ai and another one
is for common poison consumption handler.

Signed-off-by: Tao Zhou 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

Revert "drm/amdgpu: let mode2 reset fallback to default when failure"

2022-10-19T02:08:33+00:00

This reverts commit dac6b80818ac2353631c5a33d140d8d5508e2957.

This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with
the original design of reset handler. Will redesign it.

Fixes: dac6b80818ac23 ("drm/amdgpu: let mode2 reset fallback to default when failure")
Signed-off-by: Victor Zhao 
Reviewed-by: Lijo Lazar 
Signed-off-by: Alex Deucher

drm/amdgpu: let mode2 reset fallback to default when failure

2022-08-16T22:14:31+00:00

- introduce AMDGPU_SKIP_MODE2_RESET flag
- let mode2 reset fallback to default reset method if failed

v2: move this part out from the asic specific part

Signed-off-by: Victor Zhao 
Acked-by: Andrey Grodzovsky 
Signed-off-by: Alex Deucher

drm/amdgpu: support reset flag set for gpu reset

2022-07-13T15:25:17+00:00

Move reset_context out of gpu recover function to make it configurable
for different reset purpose.
For the reset way of call gpu_recovery sysfs, force to use full reset
method. Otherwise, try soft reset by default if the related ASIC
supportted, if soft reset failed, will use full reset.

Signed-off-by: Likun Gao 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover

2022-06-10T19:26:12+00:00

We removed the wrapper that was queueing the recover function
into reset domain queue who was using this name.

Signed-off-by: Andrey Grodzovsky 
Reviewed-by: Christian König 
Signed-off-by: Alex Deucher

drm/amdgpu: Move in_gpu_reset into reset_domain

2022-02-09T17:17:57+00:00

We should have a single instance per entrire reset domain.

Signed-off-by: Andrey Grodzovsky 
Suggested-by: Lijo Lazar 
Reviewed-by: Christian König 
Link: https://www.spinics.net/lists/amd-gfx/msg74116.html