linux-stable.git/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c, branch v6.12

drm/amdgpu: Retire query_utcl2_poison_status callback

2024-08-23T14:53:16+00:00

Driver switches to interrupt source id to identify
utcl2 poison event. polling interface is not needed.

Signed-off-by: Hawking Zhang 
Reviewed-by: Tao Zhou 
Signed-off-by: Alex Deucher

drm/amdkfd: APIs to stop/start KFD scheduling

2024-08-21T02:07:28+00:00

Provide amdgpu_amdkfd_stop_sched() for amdgpu to stop KFD scheduling
compute work on HIQ. amdgpu_amdkfd_start_sched() resumes the scheduling.
When amdgpu_amdkfd_stop_sched is called, KFD will unmap queues from
runlist. If users send ioctls to KFD to create queues, they'll be added
but those queues won't be mapped to runlist (so not scheduled) until
amdgpu_amdkfd_start_sched is called.

v2: fix build (Alex)

Signed-off-by: Amber Lin 
Signed-off-by: Alex Deucher

drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer

2024-07-23T21:34:44+00:00

Pass pointer reference to amdgpu_bo_unref to clear the correct pointer,
otherwise amdgpu_bo_unref clear the local variable, the original pointer
not set to NULL, this could cause use-after-free bug.

Signed-off-by: Philip Yang 
Reviewed-by: Felix Kuehling 
Acked-by: Christian König 
Signed-off-by: Alex Deucher

drm/amdkfd: add reset cause in gpu pre-reset smi event

2024-06-05T15:25:14+00:00

reset cause is requested by customer as additional
info for gpu reset smi event.

v2: integerate reset sources suggested by Lijo Lazar

Signed-off-by: Eric Huang 
Reviewed-by: Lijo Lazar 
Signed-off-by: Alex Deucher

drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs

2024-05-08T19:17:04+00:00

Small APUs(i.e., consumer, embedded products) usually have a small
carveout device memory which can't satisfy most compute workloads
memory allocation requirements.

We can't even run a Basic MNIST Example with a default 512MB carveout.
https://github.com/pytorch/examples/tree/main/mnist. Error Log:

"torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate
84.00 MiB. GPU 0 has a total capacity of 512.00 MiB of which 0 bytes
is free. Of the allocated memory 103.83 MiB is allocated by PyTorch,
and 22.17 MiB is reserved by PyTorch but unallocated"

Though we can change BIOS settings to enlarge carveout size,
which is inflexible and may bring complaint. On the other hand,
the memory resource can't be effectively used between host and device.

The solution is MI300A approach, i.e., let VRAM allocations go to GTT.
Then device and host can flexibly and effectively share memory resource.

v2: Report local_mem_size_private as 0. (Felix)

Signed-off-by: Lang Yu 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher

drm/amdgpu: prepare to handle pasid poison consumption

2024-04-26T21:22:42+00:00

Prepare to handle pasid poison consumption.

Signed-off-by: YiPeng Chai 
Reviewed-by: Tao Zhou 
Signed-off-by: Alex Deucher

drm/amdgpu: make reset method configurable for RAS poison

2024-03-20T17:38:15+00:00

Each RAS block has different requirement for gpu reset in poison
consumption handling.
Add support for mmhub RAS poison consumption handling.

v2: remove the mmhub poison support for kfd int v10.

Signed-off-by: Tao Zhou 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: support utcl2 RAS poison query for mmhub

2024-03-20T17:38:14+00:00

Support the query for both gfxhub and mmhub, also replace
xcc_id with hub_inst.

Signed-off-by: Tao Zhou 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: retire gfx ras query_utcl2_poison_status

2024-03-20T17:37:36+00:00

Replace it with related interface in gfxhub functions.

v2: replace node id with xcc id.
    get node id for query_utcl2_poison_status

Signed-off-by: Tao Zhou 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: Init zone device and drm client after mode-1 reset on reload

2024-03-20T17:12:57+00:00

In passthrough environment, when amdgpu is reloaded after unload, mode-1
is triggered after initializing the necessary IPs, That init does not
include KFD, and KFD init waits until the reset is completed. KFD init
is called in the reset handler, but in this case, the zone device and
drm client is not initialized, causing app to create kernel panic.

v2: Removing the init KFD condition from amdgpu_amdkfd_drm_client_create.
As the previous version has the potential of creating DRM client twice.

v3: v2 patch results in SDMA engine hung as DRM open causes VM clear to SDMA
before SDMA init. Adding the condition to in drm client creation, on top of v1,
to guard against drm client creation call multiple times.

Signed-off-by: Ahmad Rehman 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher