linux-stable.git/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c, branch linux-6.16.y

drm/amdgpu: fix task hang from failed job submission during process kill

2025-08-28T14:34:38+00:00

commit aa5fc4362fac9351557eb27c745579159a2e4520 upstream.

During process kill, drm_sched_entity_flush() will kill the vm
entities. The following job submissions of this process will fail, and
the resources of these jobs have not been released, nor have the fences
been signalled, causing tasks to hang and timeout.

Fix by check entity status in amdgpu_vm_ready() and avoid submit jobs to
stopped entity.

v2: add amdgpu_vm_ready() check before amdgpu_vm_clear_freed() in
function amdgpu_cs_vm_handling().

Fixes: 1f02f2044bda ("drm/amdgpu: Avoid extra evict-restore process.")
Signed-off-by: Liu01 Tong 
Signed-off-by: Lin.Cao 
Reviewed-by: Christian König 
Signed-off-by: Alex Deucher 
(cherry picked from commit f101c13a8720c73e67f8f9d511fbbeda95bcedb1)
Signed-off-by: Greg Kroah-Hartman

drm/amdgpu: Avoid extra evict-restore process.

2025-08-28T14:34:30+00:00

commit 1f02f2044bda1db1fd995bc35961ab075fa7b5a2 upstream.

If vm belongs to another process, this is fclose after fork,
wait may enable signaling KFD eviction fence and cause parent process queue evicted.

[677852.634569]  amdkfd_fence_enable_signaling+0x56/0x70 [amdgpu]
[677852.634814]  __dma_fence_enable_signaling+0x3e/0xe0
[677852.634820]  dma_fence_wait_timeout+0x3a/0x140
[677852.634825]  amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
[677852.634831]  amdgpu_vm_wait_idle+0x2d/0x60 [amdgpu]
[677852.635026]  amdgpu_flush+0x34/0x50 [amdgpu]
[677852.635208]  filp_flush+0x38/0x90
[677852.635213]  filp_close+0x14/0x30
[677852.635216]  do_close_on_exec+0xdd/0x130
[677852.635221]  begin_new_exec+0x1da/0x490
[677852.635225]  load_elf_binary+0x307/0xea0
[677852.635231]  ? srso_alias_return_thunk+0x5/0xfbef5
[677852.635235]  ? ima_bprm_check+0xa2/0xd0
[677852.635240]  search_binary_handler+0xda/0x260
[677852.635245]  exec_binprm+0x58/0x1a0
[677852.635249]  bprm_execve.part.0+0x16f/0x210
[677852.635254]  bprm_execve+0x45/0x80
[677852.635257]  do_execveat_common.isra.0+0x190/0x200

Suggested-by: Christian König 
Signed-off-by: Gang Ba 
Reviewed-by: Christian König 
Signed-off-by: Alex Deucher 
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman

drm/amdgpu: adjust enforce_isolation handling

2025-04-11T20:58:15+00:00

Switch from a bool to an enum and allow more options
for enforce isolation.  There are now 3 modes of operation:
- Disabled (0)
- Enabled (serialization and cleaner shader) (1)
- Enabled in legacy mode (no serialization or cleaner shader) (2)
This provides better flexibility for more use cases.

Acked-by: Srinivasan Shanmugam 
Signed-off-by: Alex Deucher

drm/amdgpu: remove is_mes_queue flag

2025-04-08T20:48:21+00:00

This was leftover from MES bring up when we had MES
user queues in the kernel.  It's no longer used so
remove it.

Acked-by: Christian König 
Signed-off-by: Alex Deucher

drm/amdgpu: add cleaner shader trace point

2025-03-21T16:16:34+00:00

Note when the cleaner shader is executed.

Signed-off-by: Christian König 
Acked-by: Srinivasan Shanmugam 
Signed-off-by: Alex Deucher

drm/amdgpu: rework how the cleaner shader is emitted v3

2025-03-21T16:16:34+00:00

Instead of emitting the cleaner shader for every job which has the
enforce_isolation flag set only emit it for the first submission from
every client.

v2: add missing NULL check
v3: fix another NULL pointer deref

Signed-off-by: Christian König 
Acked-by: Srinivasan Shanmugam 
Signed-off-by: Alex Deucher

drm/amdkfd: Fix pasid value leak

2025-02-13T02:05:50+00:00

Curret kfd does not allocate pasid values, instead uses pasid value for each
vm from graphic driver. So should not prevent graphic driver from releasing
pasid values since the values are allocated by graphic driver, not kfd driver
anymore. This patch does not stop graphic driver release pasid values.

Fixes: 8544374c0f82 ("drm/amdkfd: Have kfd driver use same PASID values from graphic driver")
Signed-off-by: Xiaogang Chen 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher

drm/amdgpu: Unlocked unmap only clear page table leaves

2025-02-13T02:05:49+00:00

SVM migration unmap pages from GPU and then update mapping to GPU to
recover page fault. Currently unmap clears the PDE entry for range
length >= huge page and free PTB bo, update mapping to alloc new PT bo.
There is race bug that the freed entry bo maybe still on the pt_free
list, reused when updating mapping and then freed, leave invalid PDE
entry and cause GPU page fault.

By setting the update to clear only one PDE entry or clear PTB, to
avoid unmap to free PTE bo. This fixes the race bug and improve the
unmap and map to GPU performance. Update mapping to huge page will
still free the PTB bo.

With this change, the vm->pt_freed list and work is not needed. Add
WARN_ON(unlocked) in amdgpu_vm_pt_free_dfs to catch if unmap to free the
PTB.

Signed-off-by: Philip Yang 
Reviewed-by: Christian König 
Signed-off-by: Alex Deucher

Merge tag 'drm-misc-next-2025-01-06' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next

2025-01-09T05:48:50+00:00

drm-misc-next for 6.14:

UAPI Changes:
- Clarify drm memory stats documentation

Cross-subsystem Changes:

Core Changes:
 - sched: Documentation fixes,

Driver Changes:
 - amdgpu: Track BO memory stats at runtime
 - amdxdna: Various fixes
 - hisilicon: New HIBMC driver
 - bridges:
   - Provide default implementation of atomic_check for HDMI bridges
   - it605: HDCP improvements, MCCS Support

Signed-off-by: Dave Airlie 

From: Maxime Ripard 
Link: https://patchwork.freedesktop.org/patch/msgid/20250106-augmented-kakapo-of-action-0cf000@houat

drm/amdgpu: track bo memory stats at runtime

2024-12-19T15:56:28+00:00

Before, every time fdinfo is queried we try to lock all the BOs in the
VM and calculate memory usage from scratch. This works okay if the
fdinfo is rarely read and the VMs don't have a ton of BOs. If either of
these conditions is not true, we get a massive performance hit.

In this new revision, we track the BOs as they change states. This way
when the fdinfo is queried we only need to take the status lock and copy
out the usage stats with minimal impact to the runtime performance. With
this new approach however, we would no longer be able to track active
buffers.

Signed-off-by: Yunxiang Li 
Reviewed-by: Christian König 
Link: https://patchwork.freedesktop.org/patch/msgid/20241219151411.1150-6-Yunxiang.Li@amd.com
Signed-off-by: Christian König