<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/drivers/gpu/drm/amd/amdkfd, branch v6.16</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>drm/amdkfd: Don't call mmput from MMU notifier callback</title>
<updated>2025-06-30T17:57:12+00:00</updated>
<author>
<name>Philip Yang</name>
<email>Philip.Yang@amd.com</email>
</author>
<published>2025-06-20T22:32:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=cf234231fcbc7d391e2135b9518613218cc5347f'/>
<id>cf234231fcbc7d391e2135b9518613218cc5347f</id>
<content type='text'>
If the process is exiting, the mmput inside mmu notifier callback from
compactd or fork or numa balancing could release the last reference
of mm struct to call exit_mmap and free_pgtable, this triggers deadlock
with below backtrace.

The deadlock will leak kfd process as mmu notifier release is not called
and cause VRAM leaking.

The fix is to take mm reference mmget_non_zero when adding prange to the
deferred list to pair with mmput in deferred list work.

If prange split and add into pchild list, the pchild work_item.mm is not
used, so remove the mm parameter from svm_range_unmap_split and
svm_range_add_child.

The backtrace of hung task:

 INFO: task python:348105 blocked for more than 64512 seconds.
 Call Trace:
  __schedule+0x1c3/0x550
  schedule+0x46/0xb0
  rwsem_down_write_slowpath+0x24b/0x4c0
  unlink_anon_vmas+0xb1/0x1c0
  free_pgtables+0xa9/0x130
  exit_mmap+0xbc/0x1a0
  mmput+0x5a/0x140
  svm_range_cpu_invalidate_pagetables+0x2b/0x40 [amdgpu]
  mn_itree_invalidate+0x72/0xc0
  __mmu_notifier_invalidate_range_start+0x48/0x60
  try_to_unmap_one+0x10fa/0x1400
  rmap_walk_anon+0x196/0x460
  try_to_unmap+0xbb/0x210
  migrate_page_unmap+0x54d/0x7e0
  migrate_pages_batch+0x1c3/0xae0
  migrate_pages_sync+0x98/0x240
  migrate_pages+0x25c/0x520
  compact_zone+0x29d/0x590
  compact_zone_order+0xb6/0xf0
  try_to_compact_pages+0xbe/0x220
  __alloc_pages_direct_compact+0x96/0x1a0
  __alloc_pages_slowpath+0x410/0x930
  __alloc_pages_nodemask+0x3a9/0x3e0
  do_huge_pmd_anonymous_page+0xd7/0x3e0
  __handle_mm_fault+0x5e3/0x5f0
  handle_mm_fault+0xf7/0x2e0
  hmm_vma_fault.isra.0+0x4d/0xa0
  walk_pmd_range.isra.0+0xa8/0x310
  walk_pud_range+0x167/0x240
  walk_pgd_range+0x55/0x100
  __walk_page_range+0x87/0x90
  walk_page_range+0xf6/0x160
  hmm_range_fault+0x4f/0x90
  amdgpu_hmm_range_get_pages+0x123/0x230 [amdgpu]
  amdgpu_ttm_tt_get_user_pages+0xb1/0x150 [amdgpu]
  init_user_pages+0xb1/0x2a0 [amdgpu]
  amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x543/0x7d0 [amdgpu]
  kfd_ioctl_alloc_memory_of_gpu+0x24c/0x4e0 [amdgpu]
  kfd_ioctl+0x29d/0x500 [amdgpu]

Fixes: fa582c6f3684 ("drm/amdkfd: Use mmget_not_zero in MMU notifier")
Signed-off-by: Philip Yang &lt;Philip.Yang@amd.com&gt;
Reviewed-by: Felix Kuehling &lt;felix.kuehling@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
(cherry picked from commit a29e067bd38946f752b0ef855f3dfff87e77bec7)
Cc: stable@vger.kernel.org
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
If the process is exiting, the mmput inside mmu notifier callback from
compactd or fork or numa balancing could release the last reference
of mm struct to call exit_mmap and free_pgtable, this triggers deadlock
with below backtrace.

The deadlock will leak kfd process as mmu notifier release is not called
and cause VRAM leaking.

The fix is to take mm reference mmget_non_zero when adding prange to the
deferred list to pair with mmput in deferred list work.

If prange split and add into pchild list, the pchild work_item.mm is not
used, so remove the mm parameter from svm_range_unmap_split and
svm_range_add_child.

The backtrace of hung task:

 INFO: task python:348105 blocked for more than 64512 seconds.
 Call Trace:
  __schedule+0x1c3/0x550
  schedule+0x46/0xb0
  rwsem_down_write_slowpath+0x24b/0x4c0
  unlink_anon_vmas+0xb1/0x1c0
  free_pgtables+0xa9/0x130
  exit_mmap+0xbc/0x1a0
  mmput+0x5a/0x140
  svm_range_cpu_invalidate_pagetables+0x2b/0x40 [amdgpu]
  mn_itree_invalidate+0x72/0xc0
  __mmu_notifier_invalidate_range_start+0x48/0x60
  try_to_unmap_one+0x10fa/0x1400
  rmap_walk_anon+0x196/0x460
  try_to_unmap+0xbb/0x210
  migrate_page_unmap+0x54d/0x7e0
  migrate_pages_batch+0x1c3/0xae0
  migrate_pages_sync+0x98/0x240
  migrate_pages+0x25c/0x520
  compact_zone+0x29d/0x590
  compact_zone_order+0xb6/0xf0
  try_to_compact_pages+0xbe/0x220
  __alloc_pages_direct_compact+0x96/0x1a0
  __alloc_pages_slowpath+0x410/0x930
  __alloc_pages_nodemask+0x3a9/0x3e0
  do_huge_pmd_anonymous_page+0xd7/0x3e0
  __handle_mm_fault+0x5e3/0x5f0
  handle_mm_fault+0xf7/0x2e0
  hmm_vma_fault.isra.0+0x4d/0xa0
  walk_pmd_range.isra.0+0xa8/0x310
  walk_pud_range+0x167/0x240
  walk_pgd_range+0x55/0x100
  __walk_page_range+0x87/0x90
  walk_page_range+0xf6/0x160
  hmm_range_fault+0x4f/0x90
  amdgpu_hmm_range_get_pages+0x123/0x230 [amdgpu]
  amdgpu_ttm_tt_get_user_pages+0xb1/0x150 [amdgpu]
  init_user_pages+0xb1/0x2a0 [amdgpu]
  amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x543/0x7d0 [amdgpu]
  kfd_ioctl_alloc_memory_of_gpu+0x24c/0x4e0 [amdgpu]
  kfd_ioctl+0x29d/0x500 [amdgpu]

Fixes: fa582c6f3684 ("drm/amdkfd: Use mmget_not_zero in MMU notifier")
Signed-off-by: Philip Yang &lt;Philip.Yang@amd.com&gt;
Reviewed-by: Felix Kuehling &lt;felix.kuehling@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
(cherry picked from commit a29e067bd38946f752b0ef855f3dfff87e77bec7)
Cc: stable@vger.kernel.org
</pre>
</div>
</content>
</entry>
<entry>
<title>amdkfd: MTYPE_UC for ext-coherent system memory</title>
<updated>2025-06-30T17:51:46+00:00</updated>
<author>
<name>David Yat Sin</name>
<email>David.YatSin@amd.com</email>
</author>
<published>2025-06-19T17:51:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=62461367f4c0dcf4fab9cafb4ab3a7d346788df6'/>
<id>62461367f4c0dcf4fab9cafb4ab3a7d346788df6</id>
<content type='text'>
Set memory mtype to UC host memory when ext-coherent
flag is set and memory is registered as a SVM allocation.

Reviewed-by: Amber Lin &lt;Amber.Lin@amd.com&gt;
Signed-off-by: David Yat Sin &lt;David.YatSin@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
(cherry picked from commit 5d14fdab4778c29cfd39e62c3ce84d232b4a7d8c)
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Set memory mtype to UC host memory when ext-coherent
flag is set and memory is registered as a SVM allocation.

Reviewed-by: Amber Lin &lt;Amber.Lin@amd.com&gt;
Signed-off-by: David Yat Sin &lt;David.YatSin@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
(cherry picked from commit 5d14fdab4778c29cfd39e62c3ce84d232b4a7d8c)
</pre>
</div>
</content>
</entry>
<entry>
<title>drm/amdkfd: Fix race in GWS queue scheduling</title>
<updated>2025-06-18T17:17:32+00:00</updated>
<author>
<name>Jay Cornwall</name>
<email>jay.cornwall@amd.com</email>
</author>
<published>2025-06-11T14:52:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=cfb05257ae168a0496c7637e1d9e3ab8a25cbffe'/>
<id>cfb05257ae168a0496c7637e1d9e3ab8a25cbffe</id>
<content type='text'>
q-&gt;gws is not updated atomically with qpd-&gt;mapped_gws_queue. If a
runlist is created between pqm_set_gws and update_queue it will
contain a queue which uses GWS in a process with no GWS allocated.
This will result in a scheduler hang.

Use q-&gt;properties.is_gws which is changed while holding the DQM lock.

Signed-off-by: Jay Cornwall &lt;jay.cornwall@amd.com&gt;
Reviewed-by: Harish Kasiviswanathan &lt;Harish.Kasiviswanathan@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
(cherry picked from commit b98370220eb3110e82248e3354e16a489a492cfb)
Cc: stable@vger.kernel.org
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
q-&gt;gws is not updated atomically with qpd-&gt;mapped_gws_queue. If a
runlist is created between pqm_set_gws and update_queue it will
contain a queue which uses GWS in a process with no GWS allocated.
This will result in a scheduler hang.

Use q-&gt;properties.is_gws which is changed while holding the DQM lock.

Signed-off-by: Jay Cornwall &lt;jay.cornwall@amd.com&gt;
Reviewed-by: Harish Kasiviswanathan &lt;Harish.Kasiviswanathan@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
(cherry picked from commit b98370220eb3110e82248e3354e16a489a492cfb)
Cc: stable@vger.kernel.org
</pre>
</div>
</content>
</entry>
<entry>
<title>drm/amdkfd: move SDMA queue reset capability check to node_show</title>
<updated>2025-06-18T16:56:52+00:00</updated>
<author>
<name>Jesse.Zhang</name>
<email>Jesse.Zhang@amd.com</email>
</author>
<published>2025-05-29T03:27:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=a55737dab6ba63eb4241e9c6547629058af31e12'/>
<id>a55737dab6ba63eb4241e9c6547629058af31e12</id>
<content type='text'>
Relocate the per-SDMA queue reset capability check from
kfd_topology_set_capabilities() to node_show() to ensure we read the
latest value of sdma.supported_reset after all IP blocks are initialized.

Fixes: ceb7114c961b ("drm/amdkfd: flag per-sdma queue reset supported to user space")
Reviewed-by: Jonathan Kim &lt;jonathan.kim@amd.com&gt;
Signed-off-by: Jesse Zhang &lt;Jesse.Zhang@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
(cherry picked from commit e17df7b086cf908cedd919f448da9e00028419bb)
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Relocate the per-SDMA queue reset capability check from
kfd_topology_set_capabilities() to node_show() to ensure we read the
latest value of sdma.supported_reset after all IP blocks are initialized.

Fixes: ceb7114c961b ("drm/amdkfd: flag per-sdma queue reset supported to user space")
Reviewed-by: Jonathan Kim &lt;jonathan.kim@amd.com&gt;
Signed-off-by: Jesse Zhang &lt;Jesse.Zhang@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
(cherry picked from commit e17df7b086cf908cedd919f448da9e00028419bb)
</pre>
</div>
</content>
</entry>
<entry>
<title>drm/amdkfd: enable kfd on RISCV systems</title>
<updated>2025-06-03T19:02:48+00:00</updated>
<author>
<name>Xuemei Liu</name>
<email>liu.xuemei1@zte.com.cn</email>
</author>
<published>2025-05-29T02:25:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=4e696906e9a82d4cab75f3083fabd65433c77e20'/>
<id>4e696906e9a82d4cab75f3083fabd65433c77e20</id>
<content type='text'>
KFD has been confirmed that can run on RISCV systems. It's necessary to
support CONFIG_HSA_AMD on RISCV.

Signed-off-by: Xuemei Liu &lt;liu.xuemei1@zte.com.cn&gt;
Signed-off-by: Felix Kuehling &lt;felix.kuehling@amd.com&gt;
Reviewed-by: Felix Kuehling &lt;felix.kuehling@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
KFD has been confirmed that can run on RISCV systems. It's necessary to
support CONFIG_HSA_AMD on RISCV.

Signed-off-by: Xuemei Liu &lt;liu.xuemei1@zte.com.cn&gt;
Signed-off-by: Felix Kuehling &lt;felix.kuehling@amd.com&gt;
Reviewed-by: Felix Kuehling &lt;felix.kuehling@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>drm/amdkfd: Map wptr BO to GART unconditionally</title>
<updated>2025-05-29T14:58:44+00:00</updated>
<author>
<name>Lang Yu</name>
<email>lang.yu@amd.com</email>
</author>
<published>2025-05-23T02:04:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=30837a49bd0aba0f311d4056cd48753955f60d40'/>
<id>30837a49bd0aba0f311d4056cd48753955f60d40</id>
<content type='text'>
For simulation C models that don't run CP FW where adev-&gt;mes.sched_version
is not populated correctly. This causes NULL dereference in
amdgpu_amdkfd_free_gtt_mem(dev-&gt;adev, (void **)&amp;pqn-&gt;q-&gt;wptr_bo_gart)
and warning on unpinned BO in amdgpu_bo_gpu_offset(q-&gt;properties.wptr_bo).

Compared with adding version check here and there,
always map wptr BO to GART simplifies things.

v2: Add NULL check in amdgpu_amdkfd_free_gtt_mem.(Philip)

Signed-off-by: Lang Yu &lt;lang.yu@amd.com&gt;
Reviewed-by: Felix Kuehling &lt;Felix.Kuehling@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
For simulation C models that don't run CP FW where adev-&gt;mes.sched_version
is not populated correctly. This causes NULL dereference in
amdgpu_amdkfd_free_gtt_mem(dev-&gt;adev, (void **)&amp;pqn-&gt;q-&gt;wptr_bo_gart)
and warning on unpinned BO in amdgpu_bo_gpu_offset(q-&gt;properties.wptr_bo).

Compared with adding version check here and there,
always map wptr BO to GART simplifies things.

v2: Add NULL check in amdgpu_amdkfd_free_gtt_mem.(Philip)

Signed-off-by: Lang Yu &lt;lang.yu@amd.com&gt;
Reviewed-by: Felix Kuehling &lt;Felix.Kuehling@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>amd/amdkfd: fix a kfd_process ref leak</title>
<updated>2025-05-29T14:57:02+00:00</updated>
<author>
<name>Yifan Zhang</name>
<email>yifan1.zhang@amd.com</email>
</author>
<published>2025-05-21T10:06:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=90237b16ec1d7afa16e2173cc9a664377214cdd9'/>
<id>90237b16ec1d7afa16e2173cc9a664377214cdd9</id>
<content type='text'>
This patch is to fix a kfd_prcess ref leak.

Signed-off-by: Yifan Zhang &lt;yifan1.zhang@amd.com&gt;
Reviewed-by: Philip Yang &lt;Philip.Yang@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This patch is to fix a kfd_prcess ref leak.

Signed-off-by: Yifan Zhang &lt;yifan1.zhang@amd.com&gt;
Reviewed-by: Philip Yang &lt;Philip.Yang@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>drm/amdkfd: Identical code for different branches</title>
<updated>2025-05-29T14:56:54+00:00</updated>
<author>
<name>Sunday Clement</name>
<email>Sunday.Clement@amd.com</email>
</author>
<published>2025-05-23T21:49:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=1091fba163834f51a02d5d149bd657804e6ab749'/>
<id>1091fba163834f51a02d5d149bd657804e6ab749</id>
<content type='text'>
This patch removes the if/else statement in the
cik_event_interrupt_wq function because it is redundant
with both branches resulting in identical outcomes,
this improves code readibility.

Signed-off-by: Sunday Clement &lt;Sunday.Clement@amd.com&gt;
Reviewed-by: Harish Kasiviswanathan &lt;Harish.Kasiviswanathan@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This patch removes the if/else statement in the
cik_event_interrupt_wq function because it is redundant
with both branches resulting in identical outcomes,
this improves code readibility.

Signed-off-by: Sunday Clement &lt;Sunday.Clement@amd.com&gt;
Reviewed-by: Harish Kasiviswanathan &lt;Harish.Kasiviswanathan@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>drm/amdkfd: Change svm_range_get_info return type</title>
<updated>2025-05-22T16:00:30+00:00</updated>
<author>
<name>Andrey Vatoropin</name>
<email>a.vatoropin@crpt.ru</email>
</author>
<published>2025-04-02T14:12:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=eed6a6b2264078806163424b7cb70a3b47cb8e97'/>
<id>eed6a6b2264078806163424b7cb70a3b47cb8e97</id>
<content type='text'>
Static analysis shows that pointer "svms" cannot be NULL because it points
to the object "struct svm_range_list". Remove the extra NULL check. It is
meaningless and harms the readability of the code.

In the function svm_range_get_info() there is no possibility of failure.
Therefore, the caller of the function svm_range_get_info() does not need
a return value. Change the function svm_range_get_info() return type from
"int" to "void".

Since the function svm_range_get_info() has a return type of "void". The
caller of the function svm_range_get_info() does not need a return value.
Delete extra code.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Signed-off-by: Andrey Vatoropin &lt;a.vatoropin@crpt.ru&gt;
Signed-off-by: Felix Kuehling &lt;felix.kuehling@amd.com&gt;
Reviewed-by: Felix Kuehling &lt;felix.kuehling@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Static analysis shows that pointer "svms" cannot be NULL because it points
to the object "struct svm_range_list". Remove the extra NULL check. It is
meaningless and harms the readability of the code.

In the function svm_range_get_info() there is no possibility of failure.
Therefore, the caller of the function svm_range_get_info() does not need
a return value. Change the function svm_range_get_info() return type from
"int" to "void".

Since the function svm_range_get_info() has a return type of "void". The
caller of the function svm_range_get_info() does not need a return value.
Delete extra code.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Signed-off-by: Andrey Vatoropin &lt;a.vatoropin@crpt.ru&gt;
Signed-off-by: Felix Kuehling &lt;felix.kuehling@amd.com&gt;
Reviewed-by: Felix Kuehling &lt;felix.kuehling@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>drm/amdkfd: Support chain runlists of XNACK+/XNACK-</title>
<updated>2025-05-16T17:37:29+00:00</updated>
<author>
<name>Amber Lin</name>
<email>Amber.Lin@amd.com</email>
</author>
<published>2025-04-29T20:11:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=e3d0870a90a8e3f63c486438ac6a77ab828f8d8d'/>
<id>e3d0870a90a8e3f63c486438ac6a77ab828f8d8d</id>
<content type='text'>
If the MEC firmware supports chaining runlists of XNACK+/XNACK-
processes, set SQ_CONFIG1 chicken bit and SET_RESOURCES bit 28.

When the MEC/HWS supports it, KFD checks the XNACK+/XNACK- processes mix
happens or not. If it does, enter over-subscription.

Signed-off-by: Amber Lin &lt;Amber.Lin@amd.com&gt;
Reviewed-by: Philip Yang &lt;Philip.Yang@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
If the MEC firmware supports chaining runlists of XNACK+/XNACK-
processes, set SQ_CONFIG1 chicken bit and SET_RESOURCES bit 28.

When the MEC/HWS supports it, KFD checks the XNACK+/XNACK- processes mix
happens or not. If it does, enter over-subscription.

Signed-off-by: Amber Lin &lt;Amber.Lin@amd.com&gt;
Reviewed-by: Philip Yang &lt;Philip.Yang@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</pre>
</div>
</content>
</entry>
</feed>
