linux-stable.git/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c, branch linux-5.0.y

drm/amdgpu: shadow in shadow_list without tbo.mem.start cause page fault in sriov TDR

2019-05-16T17:40:17+00:00

[ Upstream commit b575f10dbd6f84c2c8744ff1f486bfae1e4f6f38 ]

shadow was added into shadow_list by amdgpu_bo_create_shadow.
meanwhile, shadow->tbo.mem was not fully configured.
tbo.mem would be fully configured by amdgpu_vm_sdma_map_table until calling amdgpu_vm_clear_bo.
If sriov TDR occurred between amdgpu_bo_create_shadow and amdgpu_vm_sdma_map_table,
amdgpu_device_recover_vram would deal with shadow without tbo.mem.start.

Signed-off-by: Wentao Lou 
Reviewed-by: Christian König 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin

drm/amdgpu: amdgpu_device_recover_vram always failed if only one node in shadow_list

2019-05-10T16:36:09+00:00

[ Upstream commit 1712fb1a2f6829150032ac76eb0e39b82a549cfb ]

amdgpu_bo_restore_shadow would assign zero to r if succeeded.
r would remain zero if there is only one node in shadow_list.
current code would always return failure when r <= 0.
restart the timeout for each wait was a rather problematic bug as well.
The value of tmo SHOULD be changed, otherwise we wait tmo jiffies on each loop.

Signed-off-by: Wentao Lou 
Reviewed-by: Christian König 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin

drm/amdgpu/sriov:Correct pfvf exchange logic

2019-01-02T20:24:48+00:00

The pfvf exchange need be in exclusive mode. And add pfvf exchange in gpu
reset.

Signed-off-by: Emily Deng 
Reviewed-By: Xiangliang Yu 
Signed-off-by: Alex Deucher

drm/amdgpu/virtual_dce: No need to pin the cursor bo

2019-01-02T20:24:45+00:00

For virtual display feature, no need to pin cursor bo.

Signed-off-by: Emily Deng 
Reviewed-by: Huang Rui 
Signed-off-by: Alex Deucher

drm/amdgpu: kfd_pre_reset outside req_full_gpu cause sriov hang

2018-12-14T17:04:38+00:00

XGMI hive put kfd_pre_reset into amdgpu_device_lock_adev,
but outside req_full_gpu of sriov.
It would make sriov hang during reset.

Signed-off-by: Wentao Lou 
Reviewed-by: Shaoyun Liu 
Signed-off-by: Alex Deucher

drm/amdgpu: Enable GPU recovery by default for CI

2018-12-12T19:26:40+00:00

I retested Bonaire (gfx7 dGPU) and it works fine.

Signed-off-by: Andrey Grodzovsky 
Reviewed-by: Alex Deucher 
Signed-off-by: Alex Deucher

drm/amdgpu/si: fix SI after doorbell rework

2018-12-05T22:49:50+00:00

SI does not use doorbells, move asic doorbell init later
asic check.

Fixes: https://bugs.freedesktop.org/show_bug.cgi?id=108920
Reviewed-by: Oak Zeng 
Signed-off-by: Alex Deucher

drm/amdgpu: Implement concurrent asic reset for XGMI.

2018-12-03T16:15:14+00:00

Use per hive wq to concurrently send reset commands to all nodes
in the hive.

v2:
Switch to system_highpri_wq after dropping dedicated queue.
Fix non XGMI code path KASAN error.
Stop  the hive reset for each node loop if there
is a reset failure on any of the nodes.

Signed-off-by: Andrey Grodzovsky 
Acked-by: Alex Deucher 
Signed-off-by: Alex Deucher

drm/amdgpu: Handle xgmi device removal.

2018-12-03T16:15:08+00:00

XGMI hive has some resources allocted on device init which
needs to be deallocated when the device is unregistered.

v2: Remove creation of dedicated wq for XGMI hive reset.
v3: Use the gmc.xgmi.supported flag

Signed-off-by: Andrey Grodzovsky 
Reviewed-by: Alex Deucher 
Signed-off-by: Alex Deucher

drm/amdgpu: Fix num_doorbell calculation issue

2018-11-30T17:01:04+00:00

When paging queue is enabled, it use the second page of doorbell.
The AMDGPU_DOORBELL64_MAX_ASSIGNMENT definition assumes all the
kernel doorbells are in the first page. So with paging queue enabled,
the total kernel doorbell range should be original num_doorbell plus
one page (0x400 in dword), not *2.

Signed-off-by: Oak Zeng 
Reviewed-by: Alex Deucher 
Signed-off-by: Alex Deucher