linux-stable.git/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h, branch v5.13.2

drm/amdgpu: fix send ras disable cmd when asic not support ras

2021-03-24T03:30:12+00:00

    cause:
	It is necessary to send ras disable command to ras-ta during gfx
	block ras later init, because the ras capability is disable read
	from vbios for vega20 gaming, but the ras context is released
	during ras init process, this will cause send ras disable command
	to ras-to failed.
    how:
	Delay releasing ras context, the ras context
	will be released after gfx block later init done.

Changed from V1:
    move release_ras_context into ras_resume

Changed from V2:
    check BIT(UMC) is more reasonable before access eeprom table

Signed-off-by: Stanley.Yang 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: harvest edc status when connected to host via xGMI

2021-03-24T03:00:41+00:00

When connected to a host via xGMI, system fatal errors may trigger
warm reset, driver has no change to query edc status before reset.
Therefore in this case, driver should harvest previous error loging
registers during boot, instead of only resetting them.

v2:
1. IP's ras_manager object is created when its ras feature is enabled,
so change to query edc status after amdgpu_ras_late_init called

2. change to enable watchdog timer after finishing gfx edc init

Signed-off-by: Dennis Li 
Reivewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: remove unnecessary reading for epprom header

2021-02-26T22:23:49+00:00

If the number of badpage records exceed the threshold, driver has
updated both epprom header and control->tbl_hdr.header before gpu reset,
therefore GPU recovery thread no need to read epprom header directly.

v2: merge amdgpu_ras_check_err_threshold into amdgpu_ras_eeprom_check_err_threshold

Signed-off-by: Dennis Li 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: do not keep debugfs dentry

2021-02-18T21:43:09+00:00

Cleanup unnecessary debugfs dentries and surrounding functions.

v3: remove return value check for debugfs_create_file()
v2: remove ttm_debugfs_entries array.
    do not init variables.

Signed-off-by: Nirmoy Das 
Reviewed-by: Christian König 
Signed-off-by: Alex Deucher

drm/amdgpu: fix debugfs creation/removal, again

2020-12-09T04:02:05+00:00

There is still a warning when CONFIG_DEBUG_FS is disabled:

drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1145:13: error: 'amdgpu_ras_debugfs_create_ctrl_node' defined but not used [-Werror=unused-function]
 1145 | static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev)

Change the code again to make the compiler actually drop
this code but not warn about it.

Fixes: ae2bf61ff39e ("drm/amdgpu: guard ras debugfs creation/removal based on CONFIG_DEBUG_FS")
Reviewed-by: Tao Zhou 
Signed-off-by: Arnd Bergmann 
Signed-off-by: Alex Deucher

drm/amdgpu: fix the issue of reserving bad pages failed

2020-10-30T04:57:29+00:00

In amdgpu_ras_reset_gpu, because bad pages may not be freed,
it has high probability to reserve bad pages failed.

Change to reserve bad pages when freeing VRAM.

v2:
1. avoid allocating the drm_mm node outside of amdgpu_vram_mgr.c
2. move bad page reserving into amdgpu_ras_add_bad_pages, if vram mgr
   reserve bad page failed, it will put it into pending list, otherwise
   put it into processed list;
3. remove amdgpu_ras_release_bad_pages, because retired page's info has
   been moved into amdgpu_vram_mgr

v3:
1. formate code style;
2. rename amdgpu_vram_reserve_scope as amdgpu_vram_reservation;
3. rename scope_pending as reservations_pending;
4. rename scope_processed as reserved_pages;
5. change to iterate over all the pending ones and try to insert them
   with drm_mm_reserve_node();

v4:
1. rename amdgpu_vram_mgr_reserve_scope as
amdgpu_vram_mgr_reserve_range;
2. remove unused include "amdgpu_ras.h";
3. rename amdgpu_vram_mgr_check_and_reserve as
amdgpu_vram_mgr_do_reserve;
4. refine amdgpu_vram_mgr_reserve_range to call
amdgpu_vram_mgr_do_reserve.

Reviewed-by: Christian König 
Reviewed-by: Hawking Zhang 
Signed-off-by: Dennis Li 
Signed-off-by: Wenhui Sheng 
Signed-off-by: Alex Deucher

drm/amdgpu: remove redundant GPU reset

2020-10-30T04:57:23+00:00

Because bad pages saving has been moved to UMC error interrupt callback,
which will trigger a new GPU reset after saving.

Signed-off-by: Dennis Li 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: change to save bad pages in UMC error interrupt callback

2020-10-30T04:57:17+00:00

Instead of saving bad pages in amdgpu_ras_reset_gpu, it will reduce
the unnecessary calling of amdgpu_ras_save_bad_pages.

Signed-off-by: Dennis Li 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: bypass querying ras error count registers

2020-08-14T20:12:22+00:00

Once ras recovery is issued by ras sync flood interrupt or
ras controller interrupt, add this guard to bypass or execute
ras error count register harvest of all IPs.

Signed-off-by: Guchun Chen 
Reviewed-by: Hawking Zhang 
Reviewed-by: Dennis Li 
Signed-off-by: Alex Deucher

drm/amdgpu: break GPU recovery once it's in bad state(v4)

2020-08-04T21:26:54+00:00

When GPU executes recovery and retriving bad GPU tag
from external eerpom device, the recovery will be broken
and error message is printed as well for user's awareness.

v2: Refine warning message in threshold reaching case, and
    fix spelling typo.

v3: Fix explicit calling of bad gpu.

v4: Rename function names.

Signed-off-by: Guchun Chen 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher