linux.git/drivers/gpu/drm/amd/amdkfd, branch v6.9

drm/amdkfd: don't allow mapping the MMIO HDP page with large pages

2024-05-10T17:05:13+00:00

We don't get the right offset in that case.  The GPU has
an unused 4K area of the register BAR space into which you can
remap registers.  We remap the HDP flush registers into this
space to allow userspace (CPU or GPU) to flush the HDP when it
updates VRAM.  However, on systems with >4K pages, we end up
exposing PAGE_SIZE of MMIO space.

Fixes: d8e408a82704 ("drm/amdkfd: Expose HDP registers to user space")
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher 
Cc: stable@vger.kernel.org

Revert "drm/amdkfd: Add partition id field to location_id"

2024-05-08T19:51:18+00:00

This reverts commit c37ce764cd492f044dcdbb39616298f02b0dbc7f.

RCCL library is currently not treating spatial partitions differently,
hence this change is causing issues. Revert temporarily till RCCL
implementation is ready for spatial partitions.

Signed-off-by: Lijo Lazar 
Reviewed-by: Jonathan Kim 
Signed-off-by: Alex Deucher

drm/amdkfd: Flush the process wq before creating a kfd_process

2024-05-01T01:59:16+00:00

There is a race condition when re-creating a kfd_process for a process.
This has been observed when a process under the debugger executes
exec(3).  In this scenario:
- The process executes exec.
 - This will eventually release the process's mm, which will cause the
   kfd_process object associated with the process to be freed
   (kfd_process_free_notifier decrements the reference count to the
   kfd_process to 0).  This causes kfd_process_ref_release to enqueue
   kfd_process_wq_release to the kfd_process_wq.
- The debugger receives the PTRACE_EVENT_EXEC notification, and tries to
  re-enable AMDGPU traps (KFD_IOC_DBG_TRAP_ENABLE).
 - When handling this request, KFD tries to re-create a kfd_process.
   This eventually calls kfd_create_process and kobject_init_and_add.

At this point the call to kobject_init_and_add can fail because the
old kfd_process.kobj has not been freed yet by kfd_process_wq_release.

This patch proposes to avoid this race by making sure to drain
kfd_process_wq before creating a new kfd_process object.  This way, we
know that any cleanup task is done executing when we reach
kobject_init_and_add.

Signed-off-by: Lancelot SIX 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher

drm/amdkfd: Add VRAM accounting for SVM migration

2024-04-24T03:23:45+00:00

Do VRAM accounting when doing migrations to vram to make sure
there is enough available VRAM and migrating to VRAM doesn't evict
other possible non-unified memory BOs. If migrating to VRAM fails,
driver can fall back to using system memory seamlessly.

Signed-off-by: Mukul Joshi 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher

drm/amdkfd: Fix rescheduling of restore worker

2024-04-24T03:23:28+00:00

Handle the case that the restore worker was already scheduled by another
eviction while the restore was in progress.

Fixes: 9a1c1339abf9 ("drm/amdkfd: Run restore_workers on freezable WQs")
Signed-off-by: Felix Kuehling 
Reviewed-by: Philip Yang 
Tested-by: Yunxiang Li 
Signed-off-by: Alex Deucher 
Cc: stable@vger.kernel.org

drm/amdkfd: Fix eviction fence handling

2024-04-24T03:17:30+00:00

Handle case that dma_fence_get_rcu_safe returns NULL.

If restore work is already scheduled, only update its timer. The same
work item cannot be queued twice, so undo the extra queue eviction.

Fixes: 9a1c1339abf9 ("drm/amdkfd: Run restore_workers on freezable WQs")
Signed-off-by: Felix Kuehling 
Reviewed-by: Philip Yang 
Tested-by: Gang BA 
Reviewed-by: Gang BA 
Signed-off-by: Alex Deucher 
Cc: stable@vger.kernel.org

drm/amdkfd: Fix memory leak in create_process failure

2024-04-17T15:05:09+00:00

Fix memory leak due to a leaked mmget reference on an error handling
code path that is triggered when attempting to create KFD processes
while a GPU reset is in progress.

Fixes: 0ab2d7532b05 ("drm/amdkfd: prepare per-process debug enable and disable")
CC: Xiaogang Chen 
Signed-off-by: Felix Kuehling 
Tested-by: Harish Kasiviswanthan 
Reviewed-by: Mukul Joshi 
Signed-off-by: Alex Deucher 
Cc: stable@vger.kernel.org

amdkfd: use calloc instead of kzalloc to avoid integer overflow

2024-04-12T01:11:59+00:00

This uses calloc instead of doing the multiplication which might
overflow.

Cc: stable@vger.kernel.org
Signed-off-by: Dave Airlie

amd/amdkfd: sync all devices to wait all processes being evicted

2024-04-10T03:28:30+00:00

If there are more than one device doing reset in parallel, the first
device will call kfd_suspend_all_processes() to evict all processes
on all devices, this call takes time to finish. other device will
start reset and recover without waiting. if the process has not been
evicted before doing recover, it will be restored, then caused page
fault.

Signed-off-by: Zhigang Luo 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher

drm/amdkfd: Reset GPU on queue preemption failure

2024-04-10T03:09:31+00:00

Currently, with F32 HWS GPU reset is only when unmap queue fails.

However, if compute queue doesn't repond to preemption request in time
unmap will return without any error. In this case, only preemption error
is logged and Reset is not triggered. Call GPU reset in this case also.

Reviewed-by: Alex Deucher 
Signed-off-by: Harish Kasiviswanathan 
Reviewed-by: Mukul Joshi 
Signed-off-by: Alex Deucher 
Cc: stable@vger.kernel.org