linux-stable.git - Linux kernel stable tree

Age	Commit message (Collapse)	Author
2026-04-03	drm/amdgpu: Add function to fill fw reserve region	Lijo Lazar
	Add a function to fill in details for firmware reserve region. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-04-03	drm/amdgpu: Group filling reserve region details	Lijo Lazar
	Add a function which groups filling of reserve region information. It may not cover all as info on some regions are still filled outside like those from atomfirmware tables. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-04-03	drm/amdgpu: Add memory training reserve-region	Lijo Lazar
	Use reserve region helpers for initializing/reserving memory training region. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-04-03	drm/amdgpu: Add host driver reserved-region	Lijo Lazar
	Use reserve region helpers for initializing/reserving host driver reserved region in virtualization environment. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-04-03	drm/amdgpu: Add fw vram usage reserve-region	Lijo Lazar
	Use reserve region helpers for initializing/reserving firmware usage region in virtualized environments. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-04-03	drm/amdgpu: Add firmware extended reserve-region	Lijo Lazar
	Use reserve region helpers for initializing/reserving extended firmware reservation area. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-04-03	drm/amdgpu: Add fw_reserved reserve-region	Lijo Lazar
	Use reserve region helpers for initializing/reserving fw_reserved region. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-04-03	drm/amdgpu: Add stolen_reserved reserve-region	Lijo Lazar
	Use reserve region helpers for initializing/reserving stolen_reserved region. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-04-03	drm/amdgpu: Add extended stolen vga reserve-region	Lijo Lazar
	Use reserve region helpers for initializing/reserving extended stolen vga region. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-04-03	drm/amdgpu: Add stolen vga reserve-region	Lijo Lazar
	Use reserve region helpers for initializing/reserving stolen vga region. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-04-03	drm/amdgpu: Add reserved region ids	Lijo Lazar
	Add reserved regions and helper functions to memory manager. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-04-03	drm/amdgpu/vcn4: Prevent OOB reads when parsing IB	Benjamin Cheng
	Rewrite the IB parsing to use amdgpu_ib_get_value() which handles the bounds checks. Signed-off-by: Benjamin Cheng <benjamin.cheng@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Ruijing Dong <ruijing.dong@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2026-04-03	drm/amdgpu/vcn4: Prevent OOB reads when parsing dec msg	Benjamin Cheng
	Check bounds against the end of the BO whenever we access the msg. Signed-off-by: Benjamin Cheng <benjamin.cheng@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Ruijing Dong <ruijing.dong@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2026-04-03	drm/amdgpu/vcn3: Prevent OOB reads when parsing dec msg	Benjamin Cheng
	Check bounds against the end of the BO whenever we access the msg. Signed-off-by: Benjamin Cheng <benjamin.cheng@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Ruijing Dong <ruijing.dong@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2026-04-03	drm/amdgpu/vce: Prevent partial address patches	Benjamin Cheng
	In the case that only one of lo/hi is valid, the patching could result in a bad address written to in FW. Signed-off-by: Benjamin Cheng <benjamin.cheng@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2026-04-03	drm/amdgpu: Add bounds checking to ib_{get,set}_value	Benjamin Cheng
	The uvd/vce/vcn code accesses the IB at predefined offsets without checking that the IB is large enough. Check the bounds here. The caller is responsible for making sure it can handle arbitrary return values. Also make the idx a uint32_t to prevent overflows causing the condition to fail. Signed-off-by: Benjamin Cheng <benjamin.cheng@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Ruijing Dong <ruijing.dong@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2026-03-30	drm/amdgpu/uvd4.2: Don't initialize UVD 4.2 when DPM is disabled	Timur Kristóf
	UVD 4.2 doesn't work at all when DPM is disabled because the SMU is responsible for ungating it. So, Linux fails to boot with CIK GPUs when using the amdgpu.dpm=0 parameter. Fix this by returning -ENOENT from uvd_v4_2_early_init() when amdgpu_dpm isn't enabled. Note: amdgpu.dpm=0 is often suggested as a workaround for issues and is useful for debugging. Fixes: a2e73f56fa62 ("drm/amdgpu: Add support for CIK parts") Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-30	drm/amdgpu: Fix wait after reset sequence in S4	Lijo Lazar
	For a mode-1 reset done at the end of S4 on PSPv11 dGPUs, only check if TOS is unloaded. Fixes: 32f73741d6ee ("drm/amdgpu: Wait for bootloader after PSPv11 reset") Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/4853 Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 2fb4883b884a437d760bd7bdf7695a7e5a60bba3) Cc: stable@vger.kernel.org
2026-03-30	drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 64KB	Donet Tom
	Currently, AMDGPU_VA_RESERVED_TRAP_SIZE is hardcoded to 8KB, while KFD_CWSR_TBA_TMA_SIZE is defined as 2 * PAGE_SIZE. On systems with 4K pages, both values match (8KB), so allocation and reserved space are consistent. However, on 64K page-size systems, KFD_CWSR_TBA_TMA_SIZE becomes 128KB, while the reserved trap area remains 8KB. This mismatch causes the kernel to crash when running rocminfo or rccl unit tests. Kernel attempted to read user page (2) - exploit attempt? (uid: 1001) BUG: Kernel NULL pointer dereference on read at 0x00000002 Faulting instruction address: 0xc0000000002c8a64 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries CPU: 34 UID: 1001 PID: 9379 Comm: rocminfo Tainted: G E 6.19.0-rc4-amdgpu-00320-gf23176405700 #56 VOLUNTARY Tainted: [E]=UNSIGNED_MODULE Hardware name: IBM,9105-42A POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.30 (ML1060_896) hv:phyp pSeries NIP: c0000000002c8a64 LR: c00000000125dbc8 CTR: c00000000125e730 REGS: c0000001e0957580 TRAP: 0300 Tainted: G E MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24008268 XER: 00000036 CFAR: c00000000125dbc4 DAR: 0000000000000002 DSISR: 40000000 IRQMASK: 1 GPR00: c00000000125d908 c0000001e0957820 c0000000016e8100 c00000013d814540 GPR04: 0000000000000002 c00000013d814550 0000000000000045 0000000000000000 GPR08: c00000013444d000 c00000013d814538 c00000013d814538 0000000084002268 GPR12: c00000000125e730 c000007e2ffd5f00 ffffffffffffffff 0000000000020000 GPR16: 0000000000000000 0000000000000002 c00000015f653000 0000000000000000 GPR20: c000000138662400 c00000013d814540 0000000000000000 c00000013d814500 GPR24: 0000000000000000 0000000000000002 c0000001e0957888 c0000001e0957878 GPR28: c00000013d814548 0000000000000000 c00000013d814540 c0000001e0957888 NIP [c0000000002c8a64] __mutex_add_waiter+0x24/0xc0 LR [c00000000125dbc8] __mutex_lock.constprop.0+0x318/0xd00 Call Trace: 0xc0000001e0957890 (unreliable) __mutex_lock.constprop.0+0x58/0xd00 amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x6fc/0xb60 [amdgpu] kfd_process_alloc_gpuvm+0x54/0x1f0 [amdgpu] kfd_process_device_init_cwsr_dgpu+0xa4/0x1a0 [amdgpu] kfd_process_device_init_vm+0xd8/0x2e0 [amdgpu] kfd_ioctl_acquire_vm+0xd0/0x130 [amdgpu] kfd_ioctl+0x514/0x670 [amdgpu] sys_ioctl+0x134/0x180 system_call_exception+0x114/0x300 system_call_vectored_common+0x15c/0x2ec This patch changes AMDGPU_VA_RESERVED_TRAP_SIZE to 64 KB and KFD_CWSR_TBA_TMA_SIZE to the AMD GPU page size. This means we reserve 64 KB for the trap in the address space, but only allocate 8 KB within it. With this approach, the allocation size never exceeds the reserved area. Fixes: 34a1de0f7935 ("drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole") Reviewed-by: Christian König <christian.koenig@amd.com> Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 31b8de5e55666f26ea7ece5f412b83eab3f56dbb) Cc: stable@vger.kernel.org
2026-03-30	drm/amdgpu/userq: fix memory leak in MQD creation error paths	Junrui Luo
	In mes_userq_mqd_create(), the memdup_user() allocations for IP-specific MQD structs are not freed when subsequent VA validation fails. The goto free_mqd label only cleans up the MQD BO object and userq_props. Fix by adding kfree() before each goto free_mqd on VA validation failure in the COMPUTE, GFX, and SDMA branches. Fixes: 9e46b8bb0539 ("drm/amdgpu: validate userq buffer virtual address and size") Reported-by: Yuhao Jiang <danisjiang@gmail.com> Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 27f5ff9e4a4150d7cf8b4085aedd3b77ddcc5d08) Cc: stable@vger.kernel.org
2026-03-30	drm/amd: Fix MQD and control stack alignment for non-4K	Donet Tom
	For gfxV9, due to a hardware bug ("based on the comments in the code here [1]"), the control stack of a user-mode compute queue must be allocated immediately after the page boundary of its regular MQD buffer. To handle this, we allocate an enlarged MQD buffer where the first page is used as the MQD and the remaining pages store the control stack. Although these regions share the same BO, they require different memory types: the MQD must be UC (uncached), while the control stack must be NC (non-coherent), matching the behavior when the control stack is allocated in user space. This logic works correctly on systems where the CPU page size matches the GPU page size (4K). However, the current implementation aligns both the MQD and the control stack to the CPU PAGE_SIZE. On systems with a larger CPU page size, the entire first CPU page is marked UC—even though that page may contain multiple GPU pages. The GPU treats the second 4K GPU page inside that CPU page as part of the control stack, but it is incorrectly mapped as UC. This patch fixes the issue by aligning both the MQD and control stack sizes to the GPU page size (4K). The first 4K page is correctly marked as UC for the MQD, and the remaining GPU pages are marked NC for the control stack. This ensures proper memory type assignment on systems with larger CPU page sizes. [1]: https://elixir.bootlin.com/linux/v6.18/source/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c#L118 Acked-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 998d6781410de1c4b787fdbf6c56e851ea7fa553)
2026-03-30	drm/amdgpu: fix the idr allocation flags	Prike Liang
	Fix the IDR allocation flags by using atomic GFP flags in non‑sleepable contexts to avoid the __might_sleep() complaint. 268.290239] [drm] Initialized amdgpu 3.64.0 for 0000:03:00.0 on minor 0 [ 268.294900] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:323 [ 268.295355] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1744, name: modprobe [ 268.295705] preempt_count: 1, expected: 0 [ 268.295886] RCU nest depth: 0, expected: 0 [ 268.296072] 2 locks held by modprobe/1744: [ 268.296077] #0: ffff8c3a44abd1b8 (&dev->mutex){....}-{4:4}, at: __driver_attach+0xe4/0x210 [ 268.296100] #1: ffffffffc1a6ea78 (amdgpu_pasid_idr_lock){+.+.}-{3:3}, at: amdgpu_pasid_alloc+0x26/0xe0 [amdgpu] [ 268.296494] CPU: 12 UID: 0 PID: 1744 Comm: modprobe Tainted: G U OE 6.19.0-custom #16 PREEMPT(voluntary) [ 268.296498] Tainted: [U]=USER, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE [ 268.296499] Hardware name: AMD Majolica-RN/Majolica-RN, BIOS RMJ1009A 06/13/2021 [ 268.296501] Call Trace: Fixes: 8f1de51f49be ("drm/amdgpu: prevent immediate PASID reuse case") Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit ea56aa2625708eaf96f310032391ff37746310ef) Cc: stable@vger.kernel.org
2026-03-30	drm/amdgpu: validate doorbell_offset in user queue creation	Junrui Luo
	amdgpu_userq_get_doorbell_index() passes the user-provided doorbell_offset to amdgpu_doorbell_index_on_bar() without bounds checking. An arbitrarily large doorbell_offset can cause the calculated doorbell index to fall outside the allocated doorbell BO, potentially corrupting kernel doorbell space. Validate that doorbell_offset falls within the doorbell BO before computing the BAR index, using u64 arithmetic to prevent overflow. Fixes: f09c1e6077ab ("drm/amdgpu: generate doorbell index for userqueue") Reported-by: Yuhao Jiang <danisjiang@gmail.com> Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit de1ef4ffd70e1d15f0bf584fd22b1f28cbd5e2ec) Cc: stable@vger.kernel.org
2026-03-30	drm/amdgpu: use multiple entities in amdgpu_move_blit	Pierre-Eric Pelloux-Prayer
	Thanks to "drm/ttm: rework pipelined eviction fence handling", ttm can deal correctly with moves and evictions being executed from different contexts. Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Acked-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-30	drm/amdgpu: use TTM_NUM_MOVE_FENCES when reserving fences	Pierre-Eric Pelloux-Prayer
	Use TTM_NUM_MOVE_FENCES as an upperbound of how many fences ttm might need to deal with moves/evictions. Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Acked-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-30	drm/amdgpu: round robin through clear_entities in amdgpu_fill_buffer	Pierre-Eric Pelloux-Prayer
	This makes clear of different BOs run in parallel. Partial jobs to clear a single BO still execute sequentially. Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-30	drm/amdgpu: allocate move entities dynamically	Pierre-Eric Pelloux-Prayer
	No functional change for now, as we always allocate a single entity. Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Acked-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-30	drm/amdgpu: allocate clear entities dynamically	Pierre-Eric Pelloux-Prayer
	No functional change for now, as we always allocate a single entity and use it everywhere. Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-30	drm/amd/display: Add Idle state manager(ISM)	Ray Wu
	[Why] Rapid allow/disallow of idle optimization calls, whether it be IPS or self-refresh features, can end up using more power if actual time-in-idle is low. It can also spam DMUB command submission in a way that prevents it from servicing other requestors. [How] Introduce the Idle State Manager (ISM) to amdgpu. It maintains a finite state machine that uses a hysteresis to determine if a delay should be inserted between a caller allowing idle, and when the actual idle optimizations are programmed. A second timer is also introduced to enable static screen optimizations (SSO) such as PSR1 and Replay low HZ idle mode. Rapid SSO enable/disable can have a negative power impact on some low hz video playback, and can introduce user lag for PSR1 (due to up to 3 frames of sync latency). This effectively rate-limits idle optimizations, based on hysteresis. This also replaces the existing delay logic used for PSR1, allowing drm_vblank_crtc_config.disable_immediate = true, and thus allowing drm_crtc_vblank_restore(). v2: * Loosen criteria for ISM to exit idle optimizations; it failed to exit idle correctly on cursor updates when there are no drm_vblank requestors, * Document default_ism_config * Convert pr_debug to trace events to reduce overhead on frequent codepaths * checkpatch.pl fixes Link: https://gitlab.freedesktop.org/drm/amd/-/issues/4527 Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3709 Fixes: 58a261bfc967 ("drm/amd/display: use a more lax vblank enable policy for older ASICs") Signed-off-by: Ray Wu <ray.wu@amd.com> Signed-off-by: Leo Li <sunpeng.li@amd.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-30	drm/amdgpu/gfx11: Add Cleaner Shader Support for GFX11.5.4	Srinivasan Shanmugam
	The Cleaner Shader is responsible for clearing LDS, VGPRs and SGPRs between GPU workloads to enforce process isolation and avoid data leakage. The cleaner shader clears per-wave GPU state (LDS, VGPRs and SGPRs) between workloads, improving process isolation and preventing stale data from being observed by subsequent tasks. This reuses the existing cleaner shader used on GFX11.0.3 and enables it for GFX11.5.4 GPUs when firmware requirements are met. Cc: Muhammad Adam <muhammad.adam@amd.com> Cc: Mario Sopena-Novales <mario.novales@amd.com> Cc: Tom Wu <Tom.Wu@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-30	drm/amdgpu: Fix wait after reset sequence in S4	Lijo Lazar
	For a mode-1 reset done at the end of S4 on PSPv11 dGPUs, only check if TOS is unloaded. Fixes: 32f73741d6ee ("drm/amdgpu: Wait for bootloader after PSPv11 reset") Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/4853 Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-30	drm/amdgpu: add support to query vram info from firmware	Gangliang Xie
	add support to query vram info from firmware v2: change APU vram type, add multi-aid check v3: seperate vram info query function into 3 parts and call them in a helper func when requirements are met. v4: calculate vram_width for v9.x Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-30	drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 64KB	Donet Tom
	Currently, AMDGPU_VA_RESERVED_TRAP_SIZE is hardcoded to 8KB, while KFD_CWSR_TBA_TMA_SIZE is defined as 2 * PAGE_SIZE. On systems with 4K pages, both values match (8KB), so allocation and reserved space are consistent. However, on 64K page-size systems, KFD_CWSR_TBA_TMA_SIZE becomes 128KB, while the reserved trap area remains 8KB. This mismatch causes the kernel to crash when running rocminfo or rccl unit tests. Kernel attempted to read user page (2) - exploit attempt? (uid: 1001) BUG: Kernel NULL pointer dereference on read at 0x00000002 Faulting instruction address: 0xc0000000002c8a64 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries CPU: 34 UID: 1001 PID: 9379 Comm: rocminfo Tainted: G E 6.19.0-rc4-amdgpu-00320-gf23176405700 #56 VOLUNTARY Tainted: [E]=UNSIGNED_MODULE Hardware name: IBM,9105-42A POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.30 (ML1060_896) hv:phyp pSeries NIP: c0000000002c8a64 LR: c00000000125dbc8 CTR: c00000000125e730 REGS: c0000001e0957580 TRAP: 0300 Tainted: G E MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24008268 XER: 00000036 CFAR: c00000000125dbc4 DAR: 0000000000000002 DSISR: 40000000 IRQMASK: 1 GPR00: c00000000125d908 c0000001e0957820 c0000000016e8100 c00000013d814540 GPR04: 0000000000000002 c00000013d814550 0000000000000045 0000000000000000 GPR08: c00000013444d000 c00000013d814538 c00000013d814538 0000000084002268 GPR12: c00000000125e730 c000007e2ffd5f00 ffffffffffffffff 0000000000020000 GPR16: 0000000000000000 0000000000000002 c00000015f653000 0000000000000000 GPR20: c000000138662400 c00000013d814540 0000000000000000 c00000013d814500 GPR24: 0000000000000000 0000000000000002 c0000001e0957888 c0000001e0957878 GPR28: c00000013d814548 0000000000000000 c00000013d814540 c0000001e0957888 NIP [c0000000002c8a64] __mutex_add_waiter+0x24/0xc0 LR [c00000000125dbc8] __mutex_lock.constprop.0+0x318/0xd00 Call Trace: 0xc0000001e0957890 (unreliable) __mutex_lock.constprop.0+0x58/0xd00 amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x6fc/0xb60 [amdgpu] kfd_process_alloc_gpuvm+0x54/0x1f0 [amdgpu] kfd_process_device_init_cwsr_dgpu+0xa4/0x1a0 [amdgpu] kfd_process_device_init_vm+0xd8/0x2e0 [amdgpu] kfd_ioctl_acquire_vm+0xd0/0x130 [amdgpu] kfd_ioctl+0x514/0x670 [amdgpu] sys_ioctl+0x134/0x180 system_call_exception+0x114/0x300 system_call_vectored_common+0x15c/0x2ec This patch changes AMDGPU_VA_RESERVED_TRAP_SIZE to 64 KB and KFD_CWSR_TBA_TMA_SIZE to the AMD GPU page size. This means we reserve 64 KB for the trap in the address space, but only allocate 8 KB within it. With this approach, the allocation size never exceeds the reserved area. Fixes: 34a1de0f7935 ("drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole") Reviewed-by: Christian König <christian.koenig@amd.com> Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-30	drm/amdgpu/userq: Fix the code alignment for readability	Sunil Khatri
	Fix the code alignment for if condition and also provide a line space between multiline if condition and next statement. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-30	drm/amdgpu: reset ras eeprom table when it is invalid	Gangliang Xie
	reset ras eeprom table when it is invalid Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-30	drm/amdgpu/userq: fix memory leak in MQD creation error paths	Junrui Luo
	In mes_userq_mqd_create(), the memdup_user() allocations for IP-specific MQD structs are not freed when subsequent VA validation fails. The goto free_mqd label only cleans up the MQD BO object and userq_props. Fix by adding kfree() before each goto free_mqd on VA validation failure in the COMPUTE, GFX, and SDMA branches. Fixes: 9e46b8bb0539 ("drm/amdgpu: validate userq buffer virtual address and size") Reported-by: Yuhao Jiang <danisjiang@gmail.com> Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-30	drm/amd: Fix MQD and control stack alignment for non-4K	Donet Tom
	For gfxV9, due to a hardware bug ("based on the comments in the code here [1]"), the control stack of a user-mode compute queue must be allocated immediately after the page boundary of its regular MQD buffer. To handle this, we allocate an enlarged MQD buffer where the first page is used as the MQD and the remaining pages store the control stack. Although these regions share the same BO, they require different memory types: the MQD must be UC (uncached), while the control stack must be NC (non-coherent), matching the behavior when the control stack is allocated in user space. This logic works correctly on systems where the CPU page size matches the GPU page size (4K). However, the current implementation aligns both the MQD and the control stack to the CPU PAGE_SIZE. On systems with a larger CPU page size, the entire first CPU page is marked UC—even though that page may contain multiple GPU pages. The GPU treats the second 4K GPU page inside that CPU page as part of the control stack, but it is incorrectly mapped as UC. This patch fixes the issue by aligning both the MQD and control stack sizes to the GPU page size (4K). The first 4K page is correctly marked as UC for the MQD, and the remaining GPU pages are marked NC for the control stack. This ensures proper memory type assignment on systems with larger CPU page sizes. [1]: https://elixir.bootlin.com/linux/v6.18/source/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c#L118 Acked-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-30	drm/amdgpu: fix the idr allocation flags	Prike Liang
	Fix the IDR allocation flags by using atomic GFP flags in non‑sleepable contexts to avoid the __might_sleep() complaint. 268.290239] [drm] Initialized amdgpu 3.64.0 for 0000:03:00.0 on minor 0 [ 268.294900] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:323 [ 268.295355] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1744, name: modprobe [ 268.295705] preempt_count: 1, expected: 0 [ 268.295886] RCU nest depth: 0, expected: 0 [ 268.296072] 2 locks held by modprobe/1744: [ 268.296077] #0: ffff8c3a44abd1b8 (&dev->mutex){....}-{4:4}, at: __driver_attach+0xe4/0x210 [ 268.296100] #1: ffffffffc1a6ea78 (amdgpu_pasid_idr_lock){+.+.}-{3:3}, at: amdgpu_pasid_alloc+0x26/0xe0 [amdgpu] [ 268.296494] CPU: 12 UID: 0 PID: 1744 Comm: modprobe Tainted: G U OE 6.19.0-custom #16 PREEMPT(voluntary) [ 268.296498] Tainted: [U]=USER, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE [ 268.296499] Hardware name: AMD Majolica-RN/Majolica-RN, BIOS RMJ1009A 06/13/2021 [ 268.296501] Call Trace: Fixes: 8f1de51f49be ("drm/amdgpu: prevent immediate PASID reuse case") Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-30	drm/amdgpu: flush coredump work before HW teardown	Jesse Zhang
	In amdgpu_device_fini_hw(), deferred coredump formatting work may still be pending when hardware and IP components are being torn down. Since the work may access device registers and memory that will be freed or powered off, it must be completed before proceeding. Add a flush_work() call for adev->coredump_work, guarded by CONFIG_DEV_COREDUMP, to ensure any pending coredump work finishes before the device enters the early IP fini stage. This avoids potential use-after-free or accessing hardware resources that are no longer available. Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Suggested-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-30	drm/amdgpu: guard atom_context in devcoredump VBIOS dump	Jesse Zhang
	During GPU reset coredump generation, amdgpu_devcoredump_fw_info() unconditionally dereferences adev->mode_info.atom_context to print VBIOS fields. On reset/teardown paths this pointer can be NULL, causing a kernel page fault from the deferred coredump workqueue. Fix by checking ctx before printing VBIOS fields: if ctx is valid, print full VBIOS information as before; This prevents NULL-dereference crashes while preserving coredump output. Observed page fault log: [ 667.933329] RIP: 0010:amdgpu_devcoredump_format+0x780/0xc00 [amdgpu] [ 667.941517] amdgpu 0002:01:00.0: Dumping IP State [ 667.949660] Code: 8d 57 74 48 c7 c6 01 65 9f c2 48 8d 7d 98 e8 97 96 7a ff 49 8d 97 b4 00 00 00 48 c7 c6 18 65 9f c2 48 8d 7d 98 e8 80 96 7a ff <41> 8b 97 f4 00 00 00 48 c7 c6 2f 65 9f c2 48 8d 7d 98 e8 69 96 7a [ 667.949666] RSP: 0018:ffffc9002302bd50 EFLAGS: 00010246 [ 667.949673] RAX: 0000000000000000 RBX: ffff888110600000 RCX: 0000000000000000 [ 667.949676] RDX: 000000000000a9b5 RSI: 0000000000000405 RDI: 000000000000a999 [ 667.949680] RBP: ffffc9002302be00 R08: ffffffffc09c3084 R09: ffffffffc09c3085 [ 667.949684] R10: 0000000000000000 R11: 0000000000000004 R12: 00000000000048e0 [ 667.993908] amdgpu 0002:01:00.0: Dumping IP State Completed [ 667.994229] R13: 0000000000000025 R14: 000000000000000c R15: 0000000000000000 [ 667.994233] FS: 0000000000000000(0000) GS:ffff88c44c2c9000(0000) knlGS:0000000000000000 [ 668.000076] amdgpu 0002:01:00.0: [drm] AMDGPU device coredump file has been created [ 668.008025] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 668.008030] CR2: 00000000000000f4 CR3: 000000011195f001 CR4: 0000000000770ef0 [ 668.008035] PKRU: 55555554 [ 668.008040] Call Trace: [ 668.008045] <TASK> [ 668.016010] amdgpu 0002:01:00.0: [drm] Check your /sys/class/drm/card16/device/devcoredump/data [ 668.023967] ? srso_alias_return_thunk+0x5/0xfbef5 [ 668.023988] ? __pfx___drm_printfn_coredump+0x10/0x10 [drm] [ 668.031950] amdgpu 0003:01:00.0: Dumping IP State [ 668.038159] ? __pfx___drm_puts_coredump+0x10/0x10 [drm] [ 668.083017] amdgpu 0003:01:00.0: Dumping IP State Completed [ 668.083824] amdgpu_devcoredump_deferred_work+0x26/0xc0 [amdgpu] [ 668.086163] amdgpu 0003:01:00.0: [drm] AMDGPU device coredump file has been created [ 668.095863] process_scheduled_works+0xa6/0x420 [ 668.095880] worker_thread+0x12a/0x270 [ 668.101223] amdgpu 0003:01:00.0: [drm] Check your /sys/class/drm/card24/device/devcoredump/data [ 668.107441] kthread+0x10d/0x230 [ 668.107451] ? __pfx_worker_thread+0x10/0x10 [ 668.107458] ? __pfx_kthread+0x10/0x10 [ 668.112709] amdgpu 0000:01:00.0: ring vcn_unified_1 timeout, signaled seq=9, emitted seq=10 [ 668.118630] ret_from_fork+0x17c/0x1f0 [ 668.118640] ? __pfx_kthread+0x10/0x10 [ 668.118647] ret_from_fork_asm+0x1a/0x30 Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Suggested-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-30	drm/amdgpu/userq: amdgpu_userq_vm_validate does not need userq mutex	Sunil Khatri
	amdgpu_userq_vm_validate function does not need userq_mutex and exec lock is good enough to locking all bos and updating the eviction fence. Also since we only need userq_mutex for amdgpu_userq_restore_all so move the locks in the function itself. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-30	drm/amdgpu: validate doorbell_offset in user queue creation	Junrui Luo
	amdgpu_userq_get_doorbell_index() passes the user-provided doorbell_offset to amdgpu_doorbell_index_on_bar() without bounds checking. An arbitrarily large doorbell_offset can cause the calculated doorbell index to fall outside the allocated doorbell BO, potentially corrupting kernel doorbell space. Validate that doorbell_offset falls within the doorbell BO before computing the BAR index, using u64 arithmetic to prevent overflow. Fixes: f09c1e6077ab ("drm/amdgpu: generate doorbell index for userqueue") Reported-by: Yuhao Jiang <danisjiang@gmail.com> Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-24	drm/amdgpu: Handle GPU page faults correctly on non-4K page systems	Donet Tom
	During a GPU page fault, the driver restores the SVM range and then maps it into the GPU page tables. The current implementation passes a GPU-page-size (4K-based) PFN to svm_range_restore_pages() to restore the range. SVM ranges are tracked using system-page-size PFNs. On systems where the system page size is larger than 4K, using GPU-page-size PFNs to restore the range causes two problems: Range lookup fails: Because the restore function receives PFNs in GPU (4K) units, the SVM range lookup does not find the existing range. This will result in a duplicate SVM range being created. VMA lookup failure: The restore function also tries to locate the VMA for the faulting address. It converts the GPU-page-size PFN into an address using the system page size, which results in an incorrect address on non-4K page-size systems. As a result, the VMA lookup fails with the message: "address 0xxxx VMA is removed". This patch passes the system-page-size PFN to svm_range_restore_pages() so that the SVM range is restored correctly on non-4K page systems. Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 074fe395fb13247b057f60004c7ebcca9f38ef46)
2026-03-24	drm/amdgpu: Fix fence put before wait in amdgpu_amdkfd_submit_ib	Srinivasan Shanmugam
	amdgpu_amdkfd_submit_ib() submits a GPU job and gets a fence from amdgpu_ib_schedule(). This fence is used to wait for job completion. Currently, the code drops the fence reference using dma_fence_put() before calling dma_fence_wait(). If dma_fence_put() releases the last reference, the fence may be freed before dma_fence_wait() is called. This can lead to a use-after-free. Fix this by waiting on the fence first and releasing the reference only after dma_fence_wait() completes. Fixes the below: drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c:697 amdgpu_amdkfd_submit_ib() warn: passing freed memory 'f' (line 696) Fixes: 9ae55f030dc5 ("drm/amdgpu: Follow up change to previous drm scheduler change.") Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 8b9e5259adc385b61a6590a13b82ae0ac2bd3482)
2026-03-24	drm/amdgpu/userq: schedule_delayed_work should be after fence signalled	Sunil Khatri
	Reorganise the amdgpu_eviction_fence_suspend_worker code so schedule_delayed_work is the last thing we do after amdgpu_userq_evict is complete and the eviction fence is signalled. Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-24	drm/amdgpu/mes12_1: emove extra ; from declaration statement	Colin Ian King
	There is a declaration statement that has a ;; at the end, remove the extraneous ; Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-24	drm/amdgpu: update outdated comment for renamed amdgpu_fence_driver_init()	Kexin Sun
	The function amdgpu_fence_driver_init() was renamed to amdgpu_fence_driver_sw_init() by commit 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)"). Update the stale reference in the amdgpu_fence_driver_init_ring() kdoc. Assisted-by: unnamed:deepseek-v3.2 coccinelle Signed-off-by: Kexin Sun <kexinsun@smail.nju.edu.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-24	drm/amdgpu/userq: convert comma to semicolon	Chen Ni
	Using a ',' in place of a ';' can have unintended side effects. Although that is not the case here, it seems best to use ';' unless ',' is intended. Found by inspection. No functional change intended. Compile tested only. Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-24	drm/amdgpu: Handle GPU page faults correctly on non-4K page systems	Donet Tom
	During a GPU page fault, the driver restores the SVM range and then maps it into the GPU page tables. The current implementation passes a GPU-page-size (4K-based) PFN to svm_range_restore_pages() to restore the range. SVM ranges are tracked using system-page-size PFNs. On systems where the system page size is larger than 4K, using GPU-page-size PFNs to restore the range causes two problems: Range lookup fails: Because the restore function receives PFNs in GPU (4K) units, the SVM range lookup does not find the existing range. This will result in a duplicate SVM range being created. VMA lookup failure: The restore function also tries to locate the VMA for the faulting address. It converts the GPU-page-size PFN into an address using the system page size, which results in an incorrect address on non-4K page-size systems. As a result, the VMA lookup fails with the message: "address 0xxxx VMA is removed". This patch passes the system-page-size PFN to svm_range_restore_pages() so that the SVM range is restored correctly on non-4K page systems. Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-03-24	drm/amdgpu/userq: dont use goto to jump when at end of function	Sunil Khatri
	In function amdgpu_userq_restore_worker we dont need to use goto as we already in the end of function and it will exit naturally. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>