linux.git/drivers/gpu/drm/amd/amdkfd, branch v7.1-rc7

drm/amdkfd: Fix buffer overflow in SDMA queue checkpoint/restore on GFX11

2026-06-03T18:54:46+00:00

The v11 MQD manager incorrectly assigned the CP-compute variants of
checkpoint_mqd/restore_mqd for KFD_MQD_TYPE_SDMA queues. These functions
use sizeof(struct v11_compute_mqd) (2048 bytes) instead of sizeof(struct
v11_sdma_mqd) (512 bytes), causing a 1536-byte overflow.

During CRIU checkpoint of an SDMA queue on Navi3x:
- checkpoint_mqd() reads 2048 bytes from a 512-byte SDMA MQD buffer,
  leaking 1536 bytes of adjacent GTT memory to userspace

During CRIU restore:
- restore_mqd() writes 2048 bytes into a 512-byte SDMA MQD buffer,
  corrupting 1536 bytes of adjacent GTT memory (often the ring buffer
  or neighboring MQDs)

This is a copy-paste regression unique to v11. All other ASIC backends
(cik, vi, v9, v10, v12) correctly use the SDMA-specific variants.

Add checkpoint_mqd_sdma() and restore_mqd_sdma() functions that properly
handle the smaller v11_sdma_mqd structure, matching the pattern used in
other MQD managers.

Fixes: cc009e613de6 ("drm/amdkfd: Add KFD support for soc21 v3")
Assisted-by: Claude:Sonnet 4-5
Signed-off-by: Andrew Martin 
Acked-by: Alex Deucher 
Signed-off-by: Alex Deucher 
(cherry picked from commit 6fa41db7ffdec97d62433adf03b7b9b759af8c2c)
Cc: stable@vger.kernel.org

drm/amdkfd: fix NULL dereference in get_queue_ids()

2026-06-03T18:54:28+00:00

When usr_queue_id_array is NULL and num_queues is non-zero,
get_queue_ids() returns NULL. The callers check only IS_ERR() on the
return value; since IS_ERR(NULL) == false the check passes, and
suspend_queues() calls q_array_invalidate() which immediately
dereferences NULL while iterating num_queues times.

Userspace can trigger this via kfd_ioctl_set_debug_trap() by supplying
num_queues > 0 with a zero queue_array_ptr, causing a kernel panic.

A NULL usr_queue_id_array with num_queues == 0 is a legitimate no-op
(q_array_invalidate never executes, and resume_queues already guards
all queue_ids dereferences behind a NULL check). Return ERR_PTR(-EINVAL)
only when num_queues is non-zero and the pointer is absent; both callers
already propagate IS_ERR() returns correctly to userspace.

Fixes: a70a93fa568b ("drm/amdkfd: add debug suspend and resume process queues operation")
Signed-off-by: Muhammad Bilal 
Signed-off-by: Alex Deucher 
(cherry picked from commit f165a82cdf503884bb1797771c61b2fcc72113d4)
Cc: stable@vger.kernel.org

drm/amdkfd: fix UAF race in destroy_queue_cpsch

2026-06-03T18:46:55+00:00

wait_on_destroy_queue() drops locks to wait for queue resume, allowing
a concurrent destroy to free the queue. Use is_being_destroyed flag to
serialize destruction.

Reviewed-by: Amir Shetaia 
Signed-off-by: Alysa Liu 
Signed-off-by: Alex Deucher 
(cherry picked from commit ac081deaf16a639ea7dff2f285fe421a33c1ade0)

drm/amdkfd: fix a vulnerability of integer overflow in kfd debugger

2026-05-27T16:01:13+00:00

get_queue_ids() computes array_size = num_queues * sizeof(uint32_t),
which could overflow on 32-bit size_t build. using array_size()
instead, it saturates to SIZE_MAX on overflow.

Signed-off-by: Eric Huang 
Acked-by: Alex Deucher 
Signed-off-by: Alex Deucher 
(cherry picked from commit 2d57a0475f085c08b49312dfd8edcb461845f285)
Cc: stable@vger.kernel.org

drm/amdkfd: Check for pdd drm file first in CRIU restore path

2026-05-27T15:59:24+00:00

CRIU restore ioctls are meant to be called by CRIU with no
existing drm file. There's an error path
for if the drm file unexpectedly exists. It was positioned so
it was missing a fput(drm_file).

Do that check earlier, as soon as we have the pdd.

Signed-off-by: David Francis 
Reviewed-by: Alex Deucher 
Signed-off-by: Alex Deucher 
(cherry picked from commit 2bab781dac78916c5cc8de76345a4102449267d7)
Cc: stable@vger.kernel.org

drm/amdkfd: fix NULL pointer bug in svm_range_set_attr

2026-05-27T15:57:49+00:00

The process_info could be NULL if user doesn't call kfd_ioctl_acquire_vm
before calling kfd_ioctl_svm.

Signed-off-by: Eric Huang 
Reviewed-by: Alex Deucher 
Signed-off-by: Alex Deucher 
(cherry picked from commit 83a26c812e0529eb040d31a76f73e33e637243d4)
Cc: stable@vger.kernel.org

drm/amdgpu: unmap all user mappings of framebuffer and doorbell before mode1 reset

2026-05-19T16:14:55+00:00

During Mode 1 reset, the ASIC undergoes a reset cycle and becomes temporarily
inaccessible via PCIe. Any attempt to access framebuffer or MMIO registers during
this window can result in uncompleted PCIe transactions, leading to NMI panics or
system hangs.

To prevent this, Unmap all of the applications mappings of the framebuffer
and doorbell BARs before mode1 reset. Also prevent new mappings from coming in
during the reset process.

v2: remove inode in kfd_dev (Christian)
v3: correct unmap offset (Felix), remove prevent new mappings part
to avoid deadlock (Christian)

Reviewed-by: Felix Kuehling 
Signed-off-by: Yifan Zhang 
Signed-off-by: Alex Deucher 
(cherry picked from commit 70cadefcc6160c575b04f763ada34c20e868d577)

drm/amdkfd: Check bounds for allocate_sdma_queue restore_sdma_id

2026-05-19T16:11:43+00:00

allocate_sdma_queue has an option where the sdma queue id can be
specified (used by CRIU). We weren't bounds-checking that
value.

Confirm it's less than the maximum number of queues.

Signed-off-by: David Francis 
Reviewed-by: Harish Kasiviswanathan 
Signed-off-by: Alex Deucher 
(cherry picked from commit bfe9a7545b2a7be1c543f1741e16f2d5ec4116ae)

drm/amdkfd: Check bounds on allocate_doorbell

2026-05-19T16:11:26+00:00

allocated_doorbell has an option to set the doorbell id
to a specific value (used by CRIU). This value was not
bounds checked.

Check to confirm it's less than KFD_MAX_NUM_OF_QUEUES_PER_PROCESS.

Signed-off-by: David Francis 
Reviewed-by: Harish Kasiviswanathan 
Signed-off-by: Alex Deucher 
(cherry picked from commit 1f087bb8cf9e8797633da35c85435e557ef74d06)

drm/amdkfd: Fix OOB memory exposure in get_wave_state()

2026-05-19T16:10:04+00:00

The get_wave_state() function for v9 trusts cp_hqd_cntl_stack_size and
cp_hqd_cntl_stack_offset values read directly from the MQD, which are
written by GPU microcode and fully attacker-controlled on the
CRIU-restore path (via AMDKFD_IOC_RESTORE_PROCESS with H3).

this leads to an unbounded copy_to_user() that can leak adjacent
GTT/kernel memory. If offset > size, integer underflow produces a ~4 GiB
read length, if size is set to 1 MiB against a 4 KiB allocation, we leak
1 MiB of adjacent kernel memory (other queues' MQDs, ring buffers, KASLR
pointers).

Fix by clamping both cp_hqd_cntl_stack_size to the actual allocated
buffer size (q->ctl_stack_size) and cp_hqd_cntl_stack_offset to the
clamped size before performing arithmetic and copy_to_user().

This ensures we never read beyond the allocated kernel BO regardless of
attacker-supplied MQD field values.

Signed-off-by: Sunday Clement 
Acked-by: Alex Deucher 
Signed-off-by: Alex Deucher 
(cherry picked from commit 7ef144458f48d5589e36f1b3d83e83db2e5c5ba5)