linux-stable.git/drivers/gpu/drm/amd/amdkfd, branch v6.14

drm/amdkfd: Fix user queue validation on Gfx7/8

2025-03-18T20:31:25+00:00

To workaround queue full h/w issue on Gfx7/8, when application create
AQL queue, the ring buffer bo allocate size is queue_size/2 and
map queue_size ring buffer to GPU in 2 pieces using 2 attachments, each
attachment map size is queue_size/2, with same ring_bo backing memory.

For Gfx7/8, user queue buffer validation should use queue_size/2 to
verify ring_bo allocation and mapping size.

Fixes: 68e599db7a54 ("drm/amdkfd: Validate user queue buffers")
Suggested-by: Tomáš Trnka 
Signed-off-by: Philip Yang 
Acked-by: Alex Deucher 
Signed-off-by: Alex Deucher 
(cherry picked from commit e7a477735f1771b9a9346a5fbd09d7ff0641723a)
Cc: stable@vger.kernel.org

drm/amdgpu: Restore uncached behaviour on GFX12

2025-03-18T20:29:52+00:00

Always use MTYPE_UC if UNCACHED flag is specified.

This makes kernarg region uncached and it restores
usermode cache disable debug flag functionality.

Do not set MTYPE_UC for COHERENT flag, on GFX12 coherence is handled by
shader code.

Signed-off-by: David Belanger 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher 
(cherry picked from commit eb6cdfb807d038d9b9986b5c87188f28a4071eae)
Cc: stable@vger.kernel.org # 6.12.x

drm/amdkfd: Fix instruction hazard in gfx12 trap handler

2025-03-18T20:28:34+00:00

VALU instructions with SGPR source need wait states to avoid hazard
with SALU using different SGPR.

v2: Eliminate some hazards to reduce code explosion

Signed-off-by: Jay Cornwall 
Reviewed-by: Lancelot Six 
Acked-by: Alex Deucher 
Signed-off-by: Alex Deucher 
(cherry picked from commit 7e0459d453b911435673edd7a86eadc600c63238)
Cc: stable@vger.kernel.org # 6.12.x

drm/amd/amdkfd: Evict all queues even HWS remove queue failed

2025-03-12T18:59:21+00:00

[Why]
If reset is detected and kfd need to evict working queues, HWS moving queue will be failed.
Then remaining queues are not evicted and in active state.

After reset done, kfd uses HWS to termination remaining activated queues but HWS is resetted.
So remove queue will be failed again.

[How]
Keep removing all queues even if HWS returns failed.
It will not affect cpsch as it checks reset_domain->sem.

v2: If any queue failed, evict queue returns error.
v3: Declare err inside the if-block.

Reviewed-by: Felix Kuehling 
Signed-off-by: Yifan Zha 
Signed-off-by: Alex Deucher 
(cherry picked from commit 42c854b8fb0cce512534aa2b7141948e80c6ebb0)
Cc: stable@vger.kernel.org

drm/amdkfd: Fix NULL Pointer Dereference in KFD queue

2025-03-05T16:47:00+00:00

Through KFD IOCTL Fuzzing we encountered a NULL pointer derefrence
when calling kfd_queue_acquire_buffers.

Fixes: 629568d25fea ("drm/amdkfd: Validate queue cwsr area and eop buffer size")
Signed-off-by: Andrew Martin 
Reviewed-by: Philip Yang 
Signed-off-by: Andrew Martin 
Signed-off-by: Alex Deucher 
(cherry picked from commit 049e5bf3c8406f87c3d8e1958e0a16804fa1d530)
Cc: stable@vger.kernel.org

drm/amdkfd: Preserve cp_hqd_pq_control on update_mqd

2025-02-25T17:19:25+00:00

When userspace applications call AMDKFD_IOC_UPDATE_QUEUE. Preserve
bitfields that do not need to be modified as they contain flags to
track queue states that are used by CP FW.

Signed-off-by: David Yat Sin 
Reviewed-by: Jay Cornwall 
Signed-off-by: Alex Deucher 
(cherry picked from commit 8150827990b709ab5a40c46c30d21b7f7b9e9440)
Cc: stable@vger.kernel.org

drm/amdkfd: Ensure consistent barrier state saved in gfx12 trap handler

2025-02-13T00:47:15+00:00

It is possible for some waves in a workgroup to finish their save
sequence before the group leader has had time to capture the workgroup
barrier state.  When this happens, having those waves exit do impact the
barrier state.  As a consequence, the state captured by the group leader
is invalid, and is eventually incorrectly restored.

This patch proposes to have all waves in a workgroup wait for each other
at the end of their save sequence (just before calling s_endpgm_saved).

Signed-off-by: Lancelot SIX 
Reviewed-by: Jay Cornwall 
Signed-off-by: Alex Deucher 
Cc: stable@vger.kernel.org # 6.12.x

amdkfd: properly free gang_ctx_bo when failed to init user queue

2025-02-13T00:47:15+00:00

The destructor of a gtt bo is declared as
void amdgpu_amdkfd_free_gtt_mem(struct amdgpu_device *adev, void **mem_obj);
Which takes void** as the second parameter.

GCC allows passing void* to the function because void* can be implicitly
casted to any other types, so it can pass compiling.

However, passing this void* parameter into the function's
execution process(which expects void** and dereferencing void**)
will result in errors.

Signed-off-by: Zhu Lingshan 
Reviewed-by: Felix Kuehling 
Fixes: fb91065851cd ("drm/amdkfd: Refactor queue wptr_bo GART mapping")
Signed-off-by: Alex Deucher

drm/amdkfd: only flush the validate MES contex

2025-01-28T21:24:39+00:00

The following page fault was observed duringthe KFD process release.
In this particular error case, the HIP test (./MemcpyPerformance -h)
does not require the queue. As a result, the process_context_addr was
not assigned when the KFD process was released, ultimately leading to
this page fault during the execution of the function
kfd_process_dequeue_from_all_devices().

[345962.294891] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0)
[345962.295333] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 10
[345962.295775] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B33
[345962.296097] amdgpu 0000:03:00.0: amdgpu:     Faulty UTCL2 client ID: CPC (0x5)
[345962.296394] amdgpu 0000:03:00.0: amdgpu:     MORE_FAULTS: 0x1
[345962.296633] amdgpu 0000:03:00.0: amdgpu:     WALKER_ERROR: 0x1
[345962.296876] amdgpu 0000:03:00.0: amdgpu:     PERMISSION_FAULTS: 0x3
[345962.297135] amdgpu 0000:03:00.0: amdgpu:     MAPPING_ERROR: 0x1
[345962.297377] amdgpu 0000:03:00.0: amdgpu:     RW: 0x0
[345962.297682] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0)

Signed-off-by: Prike Liang 
Reviewed-by: Jonathan Kim 
Signed-off-by: Alex Deucher 
Cc: stable@vger.kernel.org

drm/amdkfd: Block per-queue reset when halt_if_hws_hang=1

2025-01-28T21:22:02+00:00

The purpose of halt_if_hws_hang is to preserve GPU state for driver
debugging when queue preemption fails. Issuing per-queue reset may
kill wavefronts which caused the preemption failure.

Signed-off-by: Jay Cornwall 
Reviewed-by: Jonathan Kim 
Signed-off-by: Alex Deucher 
Cc: stable@vger.kernel.org # 6.12.x