linux.git/drivers/gpu/drm/amd/amdkfd, branch v6.16

drm/amdkfd: Don't call mmput from MMU notifier callback

2025-06-30T17:57:12+00:00

If the process is exiting, the mmput inside mmu notifier callback from
compactd or fork or numa balancing could release the last reference
of mm struct to call exit_mmap and free_pgtable, this triggers deadlock
with below backtrace.

The deadlock will leak kfd process as mmu notifier release is not called
and cause VRAM leaking.

The fix is to take mm reference mmget_non_zero when adding prange to the
deferred list to pair with mmput in deferred list work.

If prange split and add into pchild list, the pchild work_item.mm is not
used, so remove the mm parameter from svm_range_unmap_split and
svm_range_add_child.

The backtrace of hung task:

 INFO: task python:348105 blocked for more than 64512 seconds.
 Call Trace:
  __schedule+0x1c3/0x550
  schedule+0x46/0xb0
  rwsem_down_write_slowpath+0x24b/0x4c0
  unlink_anon_vmas+0xb1/0x1c0
  free_pgtables+0xa9/0x130
  exit_mmap+0xbc/0x1a0
  mmput+0x5a/0x140
  svm_range_cpu_invalidate_pagetables+0x2b/0x40 [amdgpu]
  mn_itree_invalidate+0x72/0xc0
  __mmu_notifier_invalidate_range_start+0x48/0x60
  try_to_unmap_one+0x10fa/0x1400
  rmap_walk_anon+0x196/0x460
  try_to_unmap+0xbb/0x210
  migrate_page_unmap+0x54d/0x7e0
  migrate_pages_batch+0x1c3/0xae0
  migrate_pages_sync+0x98/0x240
  migrate_pages+0x25c/0x520
  compact_zone+0x29d/0x590
  compact_zone_order+0xb6/0xf0
  try_to_compact_pages+0xbe/0x220
  __alloc_pages_direct_compact+0x96/0x1a0
  __alloc_pages_slowpath+0x410/0x930
  __alloc_pages_nodemask+0x3a9/0x3e0
  do_huge_pmd_anonymous_page+0xd7/0x3e0
  __handle_mm_fault+0x5e3/0x5f0
  handle_mm_fault+0xf7/0x2e0
  hmm_vma_fault.isra.0+0x4d/0xa0
  walk_pmd_range.isra.0+0xa8/0x310
  walk_pud_range+0x167/0x240
  walk_pgd_range+0x55/0x100
  __walk_page_range+0x87/0x90
  walk_page_range+0xf6/0x160
  hmm_range_fault+0x4f/0x90
  amdgpu_hmm_range_get_pages+0x123/0x230 [amdgpu]
  amdgpu_ttm_tt_get_user_pages+0xb1/0x150 [amdgpu]
  init_user_pages+0xb1/0x2a0 [amdgpu]
  amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x543/0x7d0 [amdgpu]
  kfd_ioctl_alloc_memory_of_gpu+0x24c/0x4e0 [amdgpu]
  kfd_ioctl+0x29d/0x500 [amdgpu]

Fixes: fa582c6f3684 ("drm/amdkfd: Use mmget_not_zero in MMU notifier")
Signed-off-by: Philip Yang 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher 
(cherry picked from commit a29e067bd38946f752b0ef855f3dfff87e77bec7)
Cc: stable@vger.kernel.org

amdkfd: MTYPE_UC for ext-coherent system memory

2025-06-30T17:51:46+00:00

Set memory mtype to UC host memory when ext-coherent
flag is set and memory is registered as a SVM allocation.

Reviewed-by: Amber Lin 
Signed-off-by: David Yat Sin 
Signed-off-by: Alex Deucher 
(cherry picked from commit 5d14fdab4778c29cfd39e62c3ce84d232b4a7d8c)

drm/amdkfd: Fix race in GWS queue scheduling

2025-06-18T17:17:32+00:00

q->gws is not updated atomically with qpd->mapped_gws_queue. If a
runlist is created between pqm_set_gws and update_queue it will
contain a queue which uses GWS in a process with no GWS allocated.
This will result in a scheduler hang.

Use q->properties.is_gws which is changed while holding the DQM lock.

Signed-off-by: Jay Cornwall 
Reviewed-by: Harish Kasiviswanathan 
Signed-off-by: Alex Deucher 
(cherry picked from commit b98370220eb3110e82248e3354e16a489a492cfb)
Cc: stable@vger.kernel.org

drm/amdkfd: move SDMA queue reset capability check to node_show

2025-06-18T16:56:52+00:00

Relocate the per-SDMA queue reset capability check from
kfd_topology_set_capabilities() to node_show() to ensure we read the
latest value of sdma.supported_reset after all IP blocks are initialized.

Fixes: ceb7114c961b ("drm/amdkfd: flag per-sdma queue reset supported to user space")
Reviewed-by: Jonathan Kim 
Signed-off-by: Jesse Zhang 
Signed-off-by: Alex Deucher 
(cherry picked from commit e17df7b086cf908cedd919f448da9e00028419bb)

drm/amdkfd: enable kfd on RISCV systems

2025-06-03T19:02:48+00:00

KFD has been confirmed that can run on RISCV systems. It's necessary to
support CONFIG_HSA_AMD on RISCV.

Signed-off-by: Xuemei Liu 
Signed-off-by: Felix Kuehling 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher

drm/amdkfd: Map wptr BO to GART unconditionally

2025-05-29T14:58:44+00:00

For simulation C models that don't run CP FW where adev->mes.sched_version
is not populated correctly. This causes NULL dereference in
amdgpu_amdkfd_free_gtt_mem(dev->adev, (void **)&pqn->q->wptr_bo_gart)
and warning on unpinned BO in amdgpu_bo_gpu_offset(q->properties.wptr_bo).

Compared with adding version check here and there,
always map wptr BO to GART simplifies things.

v2: Add NULL check in amdgpu_amdkfd_free_gtt_mem.(Philip)

Signed-off-by: Lang Yu 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher

amd/amdkfd: fix a kfd_process ref leak

2025-05-29T14:57:02+00:00

This patch is to fix a kfd_prcess ref leak.

Signed-off-by: Yifan Zhang 
Reviewed-by: Philip Yang 
Signed-off-by: Alex Deucher

drm/amdkfd: Identical code for different branches

2025-05-29T14:56:54+00:00

This patch removes the if/else statement in the
cik_event_interrupt_wq function because it is redundant
with both branches resulting in identical outcomes,
this improves code readibility.

Signed-off-by: Sunday Clement 
Reviewed-by: Harish Kasiviswanathan 
Signed-off-by: Alex Deucher

drm/amdkfd: Change svm_range_get_info return type

2025-05-22T16:00:30+00:00

Static analysis shows that pointer "svms" cannot be NULL because it points
to the object "struct svm_range_list". Remove the extra NULL check. It is
meaningless and harms the readability of the code.

In the function svm_range_get_info() there is no possibility of failure.
Therefore, the caller of the function svm_range_get_info() does not need
a return value. Change the function svm_range_get_info() return type from
"int" to "void".

Since the function svm_range_get_info() has a return type of "void". The
caller of the function svm_range_get_info() does not need a return value.
Delete extra code.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Signed-off-by: Andrey Vatoropin 
Signed-off-by: Felix Kuehling 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher

drm/amdkfd: Support chain runlists of XNACK+/XNACK-

2025-05-16T17:37:29+00:00

If the MEC firmware supports chaining runlists of XNACK+/XNACK-
processes, set SQ_CONFIG1 chicken bit and SET_RESOURCES bit 28.

When the MEC/HWS supports it, KFD checks the XNACK+/XNACK- processes mix
happens or not. If it does, enter over-subscription.

Signed-off-by: Amber Lin 
Reviewed-by: Philip Yang 
Signed-off-by: Alex Deucher