summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu
AgeCommit message (Collapse)Author
7 daysdrm/amdgpu: fix calling VM invalidation in amdgpu_hmm_invalidate_gfxChristian König
Otherwise we don't invalidate page tables on next CS. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Tested-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit b6444d1bcbc34f6f2a31a3aab3059be082f3683e) Cc: stable@vger.kernel.org
7 daysdrm/amdgpu: fix amdgpu_hmm_range_get_pagesChristian König
The notifier sequence must only be read once or otherwise we could work with invalid pages. While at it also fix the coding style, e.g. drop the pre-initialized return value and use the common define for 2G range. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Tested-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit c08972f555945cda57b0adb72272a37910153390) Cc: stable@vger.kernel.org
7 daysdrm/amdgpu/userq: use array instead of list for userq_vasSunil Khatri
Use arrays instead of list for userq_vas since we have fixed no of bos. Also, we dont have to worry to free that memory later since this array would be free along with queue only. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit ef7dc711a664b0c548ecfdf13a00436b7446b8e7)
7 daysdrm/amdgpu/userq: move mqd_destroy to later stage to keep core obj validSunil Khatri
mqd_destroy cleans up queue core objects like mqd and fw_object which are needed for any pending fence to signal properly. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 4ad65d610096498c8e265615aba42b3c47441bb5)
7 daysdrm/amdgpu/userq: remove amdgpu_userq_create/destroy_object wrapperSunil Khatri
Remove the amdgpu_userq_create/destroy_object wrappers and use directly the kernel bo allocation function which does all the things which are done in wrapper. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Suggested-by: Christian König <christian.koenig@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit deb02080ca5d3f015cf71e56067a39ef2f141998)
7 daysdrm/amdgpu: fix potential overflow in fs_info.debugfs_nameStanley.Yang
Use snprintf() with sizeof(fs_info.debugfs_name) so a long RAS block name plus the "_err_inject" suffix cannot overflow the 32-byte buffer. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 1a58070fda26857a8f6acc0ab05428e60d5c6844)
7 daysdrm/amdgpu/userq: make sure queue is valid in the hang_detect_workSunil Khatri
Thread 1: Running amdgpu_userq_destroy which eventually remove the queue from door bell and set userq_mgr = NULL. Thread2: An interrupt might have scheduled the hang_detect_work which still need userq_mgr to be valid but could get an NULL ptrs. To fix that make sure we cancel the hang_detect_work again before setting userq_mgr to NULL. Along with that we also need all the queue va to remain valid till we could be running anything on the queue and hence moving the userq_va post hang_detect handler is cancelled. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 1a66ceb98b137d18d303b9889f0e7d8c4db73943)
7 daysdrm/amdgpu/userq: reserve root bo without interruptionSunil Khatri
Fix the code to make it an uninterruptible reservation for root bo. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit d409ab4e387d94b2e593d558b54b7bfd315e0e75)
7 daysdrm/amdgpu/userq: add amdgpu_bo_unpin when amdgpu_ttm_alloc_gart failsSunil Khatri
Unpin the wptr_obj->obj when amdgpu_ttm_alloc_gart fails. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit d8145c437ccdc2d91c579787290f82788172bea0)
7 daysdrm/amdgpu: simplify return value in amdgpu_userq_get_doorbell_indexSunil Khatri
amdgpu_userq_get_doorbell_index returns a uint64 type index as well as a int type failure values. Simplifying this and using a int type return value and getting the index in input pointer of type uint64 type. Also since it's used at once place making it static would be better. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit e947ec9d0529d5f93dbdb33cd197347f6a7b2922)
7 daysdrm/amdgpu/userq: Fix the mutex_init cleanup for fence_drv_lockSunil Khatri
mutex fence_drv_lock is destroyed in amdgpu_userq_fence_driver_free also in one of the jump condition mutex_destroy is also called leading to double mutex_destroy. So rearranging the code so amdgpu_userq_fence_driver_free takes care of the clean up along with mutex_destroy. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 384dbef269d101e5b671fc7b942c56734cd1d186)
7 daysdrm/amdgpu/userq: Fix doorbell object cleanup of queueSunil Khatri
Unpin and unref the door bell obj if queue creation fails before initialization is complete. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 8c7506f7ba945f21e5abe7f8eac0a3acca6b5330)
7 daysdrm/amdgpu: check num_entries in GEM_OP GET_MAPPING_INFOZiyi Guo
kvcalloc(args->num_entries, sizeof(*vm_entries), GFP_KERNEL) at amdgpu_gem.c:1050 uses the user-supplied num_entries directly without any upper bounds check. Since num_entries is a __u32 and sizeof(drm_amdgpu_gem_vm_entry) is 32 bytes, a large num_entries produces an allocation exceeding INT_MAX, triggering WARNING in __kvmalloc_node_noprof(), causing a kernel WARNING, TAINT_WARN, and panic on CONFIG_PANIC_ON_WARN=y systems. Add a size bounds check before we invoke the kvzalloc() to reject oversized num_entries early with -EINVAL. Fixes: 4d82724f7f2b ("drm/amdgpu: Add mapping info option for GEM_OP ioctl") Signed-off-by: Ziyi Guo <n7l8m4@u.northwestern.edu> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 1fe7bf5457f6efd7be60b17e23163ba54341d73d) Cc: stable@vger.kernel.org
7 daysdrm/amdgpu: fix lock leak on ENOMEM in AMDGPU_GEM_OP_GET_MAPPING_INFOMichael Bommarito
The AMDGPU_GEM_OP_GET_MAPPING_INFO branch of amdgpu_gem_op_ioctl() holds three cleanup-tracked resources before calling kvcalloc(): the drm_gem_object reference from drm_gem_object_lookup(), the drm_exec lock on the looked-up GEM via drm_exec_lock_obj(), and the drm_exec lock on the per-process VM root page directory via amdgpu_vm_lock_pd(). All three are released by the out_exec label that every other error path in this function jumps to. The kvcalloc() failure path returns -ENOMEM directly, skipping out_exec and leaking all three. The leaked per-process VM root PD dma_resv lock is the load-bearing leak: any subsequent operation on the same VM (further GEM ops, command-submission, eviction, TTM shrinker callbacks) blocks on the held lock. DRM_IOCTL_AMDGPU_GEM_OP is DRM_AUTH | DRM_RENDER_ALLOW, so this is an unprivileged-local denial of service against the caller's GPU context, reachable by any process with /dev/dri/renderD* access. Route the failure through out_exec so drm_exec_fini() and drm_gem_object_put() run. Reproduced on stock 7.0.0-10, Ryzen 7 5700U / Radeon Vega (Lucienne): the failing ioctl returns -ENOMEM and a second GET_MAPPING_INFO on the same fd then blocks in drm_exec_lock_obj() on the leaked dma_resv. SIGKILL on the caller does not reap the task; the fd-release path during process exit goes through amdgpu_gem_object_close() -> drm_exec_prepare_obj() on the same lock, leaving the task in D state until the box is rebooted. The patched kernel was not rebuilt and re-tested on this hardware; the fix is mechanical. Tested on a single Lucienne / Vega box only. Ziyi Guo posted an independent INT_MAX-bound check for args->num_entries in the same branch [1]; the two patches are complementary and can land in either order. Fixes: 4d82724f7f2b ("drm/amdgpu: Add mapping info option for GEM_OP ioctl") Link: https://lore.kernel.org/all/20260208000255.4073363-1-n7l8m4@u.northwestern.edu/ # [1] Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit b69d3256d79de15f54c322986ff4da68f1d65b0a) Cc: stable@vger.kernel.org
2026-05-19drm/amdgpu: fix handling in amdgpu_userq_createChristian König
Well mostly the same issues the other code had as well: 1. Memory allocation while holding the userq_mutex lock is forbidden! 2. Things were created/started/published in the wrong order. 3. The reset lock was taken in the wrong order and seems to be unecessary in the first place. 4. Error messages on invalid input parameters can spam the logs. 5. Error messages on memory allocation failures are usually superflous as well. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 89e50de5654dbe7a137e03d78629542e17ba7202)
2026-05-19drm/amdgpu: userq_va_mapped should remain true once doneSunil Khatri
Multiple queues needs these bo_va objects belonging to the same uq_mgr. So once they are mapped lets not unmap them as at any point of time any of the queues might be using it. Also userq_va_mapped should be a boolean than atomic. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 5c02889ea22575c3bcfdf212e65fac316cbc6c6a)
2026-05-19drm/amdgpu: avoid integer overflow in VA range checkCe Sun
The original addition operation in 64-bit unsigned type may encounter overflow situations. To prevent such issues and safely reject invalid inputs, the check_add_overflow() function is used. Signed-off-by: Ce Sun <cesun102@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit cc768f4dd0bb9083c813683eeec44fc23921f771)
2026-05-19drm/amd/ras: Fix UMC error address allocation leakXiang Liu
amdgpu_umc_handle_bad_pages() allocates err_data->err_addr before querying UMC error information. In the direct and firmware query paths, the pointer is reassigned to a fresh allocation before the original buffer is released, so the initial allocation is leaked on each handled event. Free the existing buffer before replacing it in those query paths so the function exit cleanup only owns the active allocation. Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 911b1bdd22c3712a22b60fcc58f7b9f2d07b0803)
2026-05-19drm/amdgpu: unmap all user mappings of framebuffer and doorbell before mode1 ↵Yifan Zhang
reset During Mode 1 reset, the ASIC undergoes a reset cycle and becomes temporarily inaccessible via PCIe. Any attempt to access framebuffer or MMIO registers during this window can result in uncompleted PCIe transactions, leading to NMI panics or system hangs. To prevent this, Unmap all of the applications mappings of the framebuffer and doorbell BARs before mode1 reset. Also prevent new mappings from coming in during the reset process. v2: remove inode in kfd_dev (Christian) v3: correct unmap offset (Felix), remove prevent new mappings part to avoid deadlock (Christian) Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 70cadefcc6160c575b04f763ada34c20e868d577)
2026-05-19drm/amdgpu: use atomic operation to achieve lockless serializationSunil Khatri
In amdgpu_seq64_alloc there is a possibility that two difference cores from two separate NODES can try to and could get the same free slot. So this fixes that race here using atomic test_and_set clear operations. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 4d50a14d346141e03a7c3905e496d91e048bc30c)
2026-05-19drm/amdgpu/vce3: Fix VCE 3 firmware size and offsetsTimur Kristóf
The VCPU BO contains the actual FW at an offset, but it was not calculated into the VCPU BO size. Subtract this from the FW size to make sure there is no out of bounds access. This may fix VM faults when using VCE 3. Cc: John Olender <john.olender@gmail.com> Fixes: e98226221467 ("drm/amdgpu: recalculate VCE firmware BO size") Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 15c369257bd85f47a514744f960c5a51c867716f)
2026-05-19drm/amdgpu/vce2: Fix VCE 2 firmware size and offsetsTimur Kristóf
The VCPU BO contains the actual FW at an offset, but it was not calculated into the VCPU BO size. Subtract this from the FW size to make sure there is no out of bounds access. Additionally, increase the VCE_V2_0_DATA_SIZE to have extra space after the VCE handles. Also increase the data size used for each VCE handle. The FW needs 23744 bytes, use 24K to be safe. This fixes VM faults when using VCE 2. Cc: John Olender <john.olender@gmail.com> Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/4802 Fixes: e98226221467 ("drm/amdgpu: recalculate VCE firmware BO size") Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit a20d21df625548c1738c0745f753c5d6eb823bc3)
2026-05-19drm/amdgpu/vce1: Stop using amdgpu_vce_resumeTimur Kristóf
The VCE1 firmware works slightly differently and is already loaded by vce_v1_0_load_fw(). It doesn't actually need to call amdgpu_vce_resume(). Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 33d8951405e2dd81ac61edebc680e2dfb6b4fc9f)
2026-05-19drm/amdgpu/vce1: Fix VCE 1 firmware size and offsetsTimur Kristóf
The VCPU BO contains the actual FW at an offset, but it was not calculated into the VCPU BO size. Subtract this from the FW size to make sure there is no out of bounds access. Make sure the stack and data offsets are aligned to the 32K TLB size. Check that the FW microcode actually fits in the space that is reserved for it. Fixes: d4a640d4b9f3 ("drm/amdgpu/vce1: Implement VCE1 IP block (v2)") Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit c16fe59f622a080fc457a57b3e8f14c780699449)
2026-05-19drm/amdgpu/vce1: Don't repeat GTT MGR node allocationTimur Kristóf
Only allocate entries from the GTT manager when the VCE GTT node is not allocated yet. This prevents the possibility of allocating them multiple times, which causes issues during GPU reset and suspend/resume. Fixes: 71aec08f80e7 ("amdgpu/vce: use amdgpu_gtt_mgr_alloc_entries") Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 8d2a20c1721cb17e22821e1b4ecbb02d475d91c5)
2026-05-19drm/amdgpu/vce1: Check if VRAM address is lower than GART.Timur Kristóf
Previously, I had assumed this was not possible so it was OK to not handle it, but now we got a report from a user who has a board that is configured this way. When the VCPU BO is already located in a low 32-bit address in VRAM (eg. when VRAM is mapped to the low address space), don't do the workaround. Fixes: 71aec08f80e7 ("amdgpu/vce: use amdgpu_gtt_mgr_alloc_entries") Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit f370ec9b164698a9ca1a7b59bfbea07f70df769d)
2026-05-19drm/amdgpu/vce1: Remove superfluous address checkTimur Kristóf
The same thing is already checked a few lines above. Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit c1dc555e760dbfc4a4710f7270f525a03d433af8)
2026-05-19drm/amdgpu/vce1: Check that the GPU address is < 128 MiBTimur Kristóf
When ensuring the low 32-bit address, make sure it is less than 128 MiB, otherwise the VCE seems to fail to initialize. This seems to be an undocumented limitation of the firmware validation mechanism. Note that in case of VCE1 the BAR address is zero and we can't change it also due to the firmware validator. When programming the mmVCE_VCPU_CACHE_OFFSETn registers, don't AND them with a mask. This is incorrect because the register mask is actually 0x0fffffff and useless because we already ensure the addresses are below the limit. Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit e729ae5f3ac73c861c062080ac8c3d666c972404)
2026-05-19drm/amdgpu: Align amdgpu_gtt_mgr entries to TLB size on Tahiti (v2)Timur Kristóf
The TLB is organized in groups of 8 entries, each one is 4K. On Tahiti, the HW requires these GART entries to be 32K-aligned. This fixes a VCE 1 firmware validation failure that can happen after suspend/resume since we use amdgpu_gtt_mgr for VCE 1. v2: - Change variable declaration order - Add comment about "V bit HW bug" Fixes: 698fa62f56aa ("drm/amdgpu: Add helper to alloc GART entries") Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 530411b465ef0b2c0cc18c2e3d7e38422b1117d1)
2026-05-19drm/amdgpu: Fix discovery offset check under VFLijo Lazar
Discovery table may be kept at offset 0 by host driver. Remove the validation check. Fixes: 01bdc7e219c4 ("drm/amdgpu: New interface to get IP discovery binary v3") Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Ellen Pan <yunru.pan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit d3f5bbd007133c64a20e81ef290a93e46c75df40)
2026-05-19drm/amdgpu: remove va cursors for all mappingsSunil Khatri
va_cursor struct needs to be cleaned even if the mapping has been removed already. Also simplify it by make it a void function as return value check isn't needed as its called during tear down. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 4d35a45c9b4c1ac5b6e3219f83c3db706b675fa2)
2026-05-19drm/amdgpu: reject non-user addresses early in GEM_USERPTR ioctlAmir Shetaia
amdgpu_gem_userptr_ioctl() currently accepts any value of args->addr and only discovers an out-of-range pointer much later, inside amdgpu_gem_object_create() and the HMM mirror registration path. Userspace can drive that path with kernel-side virtual addresses; the get_user_pages() layer rejects them, but only after the driver has already allocated a GEM object and started wiring up notifier state that then has to be torn down on failure. Add an access_ok() guard at the top of the ioctl, right after the existing page-alignment check and before flag validation, so any address that does not lie within the calling task's user address range is rejected with -EFAULT before any allocation occurs. No legitimate ROCm/HSA userspace passes kernel-mode pointers through this interface, so this is defense-in-depth rather than a behaviour change for valid callers; -EFAULT matches the convention already used by other uaccess-style rejections in the kernel. Also add an explicit #include <linux/uaccess.h>; access_ok() is otherwise only available transitively through other headers in this translation unit. Signed-off-by: Amir Shetaia <Amir.Shetaia@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 7a076df36397d780d7e4fb595287b4980451a7f5)
2026-05-19drm/amdgpu/vpe: Force collaborate sync after TRAPAlan Liu
VPE1 could possibly hang and fail to power off at the end of commands in collaboration mode. This workaround adds a COLLAB_SYNC after TRAP to force instances synchronized to avoid VPE1 fail to power off. Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Alan liu <haoping.liu@amd.com> Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/5171 Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit a8b749c5c5afb7e5daa2bfb95d958fb3c6b8f055) Cc: stable@vger.kernel.org
2026-05-19drm/amdgpu/userq: update the vm task info during signal ioctlSunil Khatri
Pagefaults does not have process information correctly populated as vm->task is not set during vm_init but should be updated while real submission. So setting that up during signal_ioctl to get the correct submission process details. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit a9b14d88b4d83e21ab965f23d1fb7b07b87e0517)
2026-05-19drm/amdgpu/userq: cancel reset work while tear down in progressSunil Khatri
While tear down of a userq_mgr is happening when all the queues are free we should cancel any reset work if pending before exiting. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 160164609f71f774c4f661227a9b7a370a86b112)
2026-05-19drm/amdgpu: rework userq reset work handlingChristian König
It is illegal to schedule reset work from another reset work! Fix this by scheduling the userq reset work directly on the work queue of the reset domain. Not fully tested, I leave that to the IGT test cases. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit fd9200ccefab94f27877d1943761d6b0ccbd89c8)
2026-05-19drm/amdgpu/userq: pin mqd and fw object bo to avoid evictionSunil Khatri
mqd and fw objects are queue core objects which should remain valid and never be unmapped and evicted for user queues to work properly. During eviction if these buffers are evicted the hw continue to use the invalid addresses and caused page faults and system hung. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit a3bbf32a336939a1d21b9561f8e53333b684b7ef)
2026-05-19drm/amdgpu/userq: use drm_exec in amdgpu_userq_fence_read_wptrSunil Khatri
To access the bo from vm mapping first lock the root bo and then the object bo of the mapping to make sure both locks are taken safely. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 3aab50410653fe7eb35eb6f9c2b27e3549ab09e6)
2026-05-11drm/amdgpu/gfx_v12_0: set gfx.rs64_enable from PFP header on GFX12Jesse Zhang
gfx_v12_0_init_microcode() always loads RS64 CP ucode but never set adev->gfx.rs64_enable, so it stayed false and code that branches on it (e.g. MEC pipe reset) used the legacy CP_MEC_CNTL path incorrectly. Match GFX11: derive RS64 mode from the PFP firmware header (v2.0) via amdgpu_ucode_hdr_version(). Log at debug when RS64 is enabled. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit b03d53598b0d2048e8fa7303b8d0784768ec4fa6)
2026-05-11drm/amd/ras: Fix CPER ring debugfs read overflowXiang Liu
The legacy CPER debugfs reader can reach the payload path without a valid pointer snapshot. The remaining user byte count is also treated as the ring occupancy in dwords, so reads past the header can copy more than requested. Take the CPER lock before sampling pointers. Resample rptr/wptr for payload reads, bound the payload copy by available dwords and the remaining user size, and advance the file position for each dword copied. Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 1e40ef87ffdc291e05ccdade8b9170cc9c1c4249)
2026-05-11drm/amdgpu: fix userq hang detection and resetChristian König
Fix lock inversions pointed out by Prike and Sunil. The hang detection timeout *CAN'T* grab locks under which we wait for fences, especially not the userq_mutex lock. Then instead of this completely broken handling with the hang_detect_fence just cancel the work when fences are processed and re-start if necessary. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 1b62077f045ac6ffde7c97005c6659569ac5c1ec)
2026-05-11drm/amdgpu: remove almost all calls to amdgpu_userq_detect_and_reset_queuesChristian König
Well the reset handling seems broken on multiple levels. As first step of fixing this remove most calls to the hang detection. That function should only be called after we run into a timeout! And *NOT* as random check spread over the code in multiple places. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 71bea36b54ccfb14cbc90f94267af6369af4e702)
2026-05-11drm/amdgpu: rework amdgpu_userq_signal_ioctl v3Christian König
This one was fortunately not looking so bad as the wait ioctl path, but there were still a few things which could be fixed/improved: 1. Allocating with GFP_ATOMIC was quite unnecessary, we can do that before taking the userq_lock. 2. Use a new mutex as protection for the fence_drv_xa so that we can do memory allocations while holding it. 3. Starting the reset timer is unnecessary when the fence is already signaled when we create it. 4. Cleanup error handling, avoid trying to free the queue when we don't even got one. v2: fix incorrect usage of xa_find, destroy the new mutex on error v3: cleanup ref ordering Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 1609eb0f81a609d350169839128cecf298c84e7a)
2026-05-11drm/amdgpu: remove deadlocks from amdgpu_userq_pre_resetChristian König
The purpose of a GPU reset is to make sure that fence can be signaled again and the signal and resume workers can make progress again. So waiting for the resume worker or any fence in the GPU reset path is just utterly nonsense. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit fcd5f065eab46993af43442fd77ee8d9eb9c5bdf)
2026-05-05drm/amdgpu: nuke amdgpu_userq_fence_slab v2Christian König
As preparation for independent fences remove the extra slab, kmalloc should do just fine. v2: use GFP_KERNEL instead of GFP_ATOMIC Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 0d831487b5be0ae59cac865a0aa87b0acc3dc717)
2026-05-05drm/amdgpu/userq: fix access to stale wptr mappingSunil Khatri
Use drm_exec to take both locks i.e vm root bo and wptr_obj bo to access the mapping data properly. This fixes the security issue of unmap the wptr_obj while a queue creation is in progress and passing other bo at same address. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 1fc6c8ab45dbee096469c08c13f6099d57a52d6c) Cc: stable@vger.kernel.org
2026-05-05drm/amdgpu: zero-initialize GART table on allocationPhilip Yang
GART TLB is flushed after unmapping but not after mapping. Since amdgpu_bo_create_kernel() does not zero-initialize the buffer, when a single PTE is written the TLB may speculatively load other uninitialized entries from the same cacheline. Those garbage entries can appear valid, and a subsequent write to another PTE in the same cacheline may cause the GPU to use a stale garbage PTE from the TLB. Fix this by calling memset_io() to zero-initialize the GART table with gart_pte_flags immediately after allocation. Using AMDGPU_GEM_CREATE_VRAM_CLEARED, SDMA-based clear will not work since SDMA needs GART to be initialized to work. Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit d9af8263b82b6eaa60c5718e0c6631c5037e4b24) Cc: stable@vger.kernel.org
2026-05-05drm/amdgpu/sdma4: replace BUG_ON with WARN_ON in fence emissionJohn B. Moore
sdma_v4_0_ring_emit_fence() contains two BUG_ON(addr & 0x3) assertions that verify fence writeback addresses are dword-aligned. These assertions can be reached from unprivileged userspace via crafted DRM_IOCTL_AMDGPU_CS submissions, causing a fatal kernel panic in a scheduler worker thread. Replace both BUG_ON() calls with WARN_ON() to log the condition without crashing the kernel. A misaligned fence address at this point indicates a driver bug, but crashing the kernel is never the correct response when the assertion is reachable from userspace. The CS IOCTL path is the correct place to filter invalid submissions; the ring emission callback is too late to do anything about it. Fixes: 2130f89ced2c ("drm/amdgpu: add SDMA v4.0 implementation (v2)") Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: John B. Moore <jbmoore61@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit b90250bd933afd1ba94d86d6b13821997b22b18e) Cc: stable@vger.kernel.org
2026-05-05drm/amdgpu/gfx9: drop unnecessary 64-bit fence flag check in KIQJohn B. Moore
Remove the BUG_ON(flags & AMDGPU_FENCE_FLAG_64BIT) assertion from gfx_v9_0_ring_emit_fence_kiq(). The KIQ hardware supports 64-bit fence writes; the 32-bit writeback address constraint is an upper-layer convention, not a hardware limitation. The check serves no purpose and should not be present. Found by code inspection while investigating related BUG_ON assertions in the GFX and compute ring emission paths. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: John B. Moore <jbmoore61@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 1b1101a46a426bb4328116bb5273c326a2780389) Cc: stable@vger.kernel.org
2026-05-01Merge tag 'amd-drm-fixes-7.1-2026-04-30' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-fixes amd-drm-fixes-7.1-2026-04-30: amdgpu: - GFX12 fix for CONFIG_DRM_DEBUG_MM configs - Fix DC analog support - Userq fixes - GART placement fix - Aldebaran SMU fixes - AMDGPU_INFO_READ_MMR_REG fix - UVD 3.1 fix - GC 6 TCC fix - Fix root reservation in amdgpu_vm_handle_fault() - RAS fix - Module reload fix for APUs - Fix build for CONFIG_DRM_FBDEV_EMULATION=n - IGT DWB regression fix - GC 11.5.4 fix - VCN user fence fixes - JPEG user fence fixes - SMU 13.0.6 fix - VCN 3/4 IB parser fixes - NV3x+ dGPU vblank fix - DCE6/8 fixes for LVDS/eDP panels without an EDID amdkfd: - Fix for when CONFIG_HSA_AMD is not set - SVM fixes Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patch.msgid.link/20260430135619.3929877-1-alexander.deucher@amd.com