linux.git - Linux kernel source tree

Age	Commit message (Collapse)	Author
6 days	net/mlx5: HWS, fix matcher leak on resize target setup failure	Dawei Feng
	hws_bwc_matcher_move() allocates a replacement matcher before setting it as the resize target. If mlx5hws_matcher_resize_set_target() fails, the replacement matcher is not attached anywhere and is leaked. Fix the leak by destroying the replacement matcher before returning from the resize-target failure path. The bug was first flagged by an experimental analysis tool we are developing for kernel memory-management bugs while analyzing v6.13-rc1. The tool is still under development and is not yet publicly available. Manual inspection confirms that the bug is still present in v7.1.1. An x86_64 allyesconfig build showed no new warnings. As we do not have a mlx5 HWS-capable device to test with, no runtime testing was able to be performed. Fixes: 2111bb970c78 ("net/mlx5: HWS, added backward-compatible API handling") Cc: stable@vger.kernel.org Signed-off-by: Dawei Feng <dawei.feng@seu.edu.cn> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260629064049.3852759-1-dawei.feng@seu.edu.cn Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 days	accel/amdxdna: Fix use-after-free in debug BO command handling	Lizhi Hou
	When a debug BO command completes, job->drv_cmd may already have been freed. Accessing it from aie2_sched_drvcmd_resp_handler() can result in a use-after-free and memory corruption. Fix this by introducing reference counting for drv_cmd objects and transferring ownership to the job while it is in flight. This ensures that the command remains valid until the completion handler finishes processing it. Fixes: 7ea046838021 ("accel/amdxdna: Support firmware debug buffer") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260701155556.663541-1-lizhi.hou@amd.com
6 days	drm/xe/rtp: Add struct types for RTP tables	Gustavo Sousa
	We currently have a mixture of styles for our RTP tables with respect of how we define the number of entries: * xe_rtp_process_to_sr() expects to receive the number of entries as arguments; * xe_rtp_process() expects the array to have a sentinel at the end of the array; * in xe_rtp_test.c, even though xe_rtp_process_to_sr() does not require a sentinel value, we need to rely on that technique to be able to count xe_rtp_entry_sr entries because simply using ARRAY_SIZE() is not possible. The style used by xe_rtp_process_to_sr() makes it hard to share the tables with other compilation units (e.g. kunit tests), since the number of entries is calculated with ARRAY_SIZE(), which is done at compile time. Since we use the size of the tables to create some bitmasks, using a sentinel style doesn't seem great either. A way to reconcile things into a single style is to have a struct type that would hold the entries array and the number of entries. Since we have xe_rtp_entry and xe_rtp_entry_sr, we would have one type for each. The advantage of the proposed approach is that now we have a nice way to share the tables directly to kunit tests with information about their size. v6: - Removed sentinels that are not needed v5: - Removed added code from conflict resolution issues v4: - Removed conflicts with main branch v3: - No changes v2: - Add compatibility with new xe_rtp_table_sr format for "bad-mcr-reg-forced-to-regular" and "bad-regular-reg-forced-to-mcr" Fixes: 828a8eaf37c3 ("drm/xe/oa: Add MMIO trigger support") Cc: stable@vger.kernel.org # v6.12+ Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Gustavo Sousa <gustavo.sousa@intel.com> Signed-off-by: Violet Monti <violet.monti@intel.com> Link: https://patch.msgid.link/20260601200947.2032784-7-violet.monti@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com> (cherry picked from commit 5ff004fdc7377905f2fe5264b8829d35e14608b8) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
6 days	drm/amdgpu/jpeg: fix jpeg_v4_0_3_is_idle detection	Boyuan Zhang
	jpeg_v4_0_3_is_idle() initializes ret to false and then accumulates ring idle status using &=. Since false & condition always remains false, the function can never report the JPEG block as idle. Initialize ret to true so the function returns true only when all JPEG rings report RB_JOB_DONE. Signed-off-by: Boyuan Zhang <boyuan.zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit e9df8e9d04e0593d17ddb069f3b7958991cd18c9) Cc: stable@vger.kernel.org
6 days	drm/amdgpu: Fix kernel panic during driver load failure	Harish Kasiviswanathan
	Avoid kernel panic if MES init fails during driver load. The KIQ ring is falsely marked as ready as ASICs that use MES, KIQ is owned by MES. BUG: kernel NULL pointer dereference, address: 0000000000000000 RIP: 0010:gfx_v12_1_wait_reg_mem+0x5a/0x1f0 [amdgpu] Call Trace: gfx_v12_1_ring_emit_reg_write_reg_wait+0x1f/0x30 [amdgpu] amdgpu_gmc_fw_reg_write_reg_wait+0xb2/0x190 [amdgpu] amdgpu_gmc_flush_gpu_tlb+0x1cc/0x230 [amdgpu] amdgpu_gart_invalidate_tlb+0x81/0xa0 [amdgpu] amdgpu_gart_unbind+0x72/0x90 [amdgpu] amdgpu_ttm_backend_unbind+0xa4/0xb0 [amdgpu] amdgpu_ttm_tt_unpopulate+0x13/0xd0 [amdgpu] amdttm_tt_unpopulate+0x29/0x70 [amdttm] ttm_bo_put+0x1eb/0x360 [amdttm] amdgpu_bo_free_kernel+0xf9/0x1f0 [amdgpu] amdgpu_ih_ring_fini+0x5a/0x90 [amdgpu] amdgpu_irq_fini_hw+0x58/0x80 [amdgpu] amdgpu_device_fini_hw+0x4e0/0x5b0 [amdgpu] amdgpu_driver_load_kms+0x60/0xa0 [amdgpu] amdgpu_pci_probe+0x28e/0x6d0 [amdgpu] pci_device_probe+0x19f/0x220 really_probe+0x1ed/0x340 driver_probe_device+0x1e/0x80 __driver_attach+0xd3/0x1a0 bus_for_each_dev+0x68/0xa0 bus_add_driver+0x19f/0x270 driver_register+0x5d/0xf0 do_one_initcall+0xac/0x200 do_init_module+0x1ec/0x280 __se_sys_finit_module+0x2de/0x310 do_syscall_64+0x6a/0x250 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 4623b958dd6da0f4c3026afdf330626a09ecb0f0) Cc: stable@vger.kernel.org
6 days	drm/amd/display: detect_link_and_local_sink: DP alt mode timeout path leaks ↵	WenTao Liang
	prev_sink reference prev_sink is unconditionally retained via dc_sink_retain at function entry, but the DP alt mode timeout path inside SIGNAL_TYPE_DISPLAY_PORT returns false without releasing prev_sink. All other return paths in the function correctly call dc_sink_release(prev_sink), making this the only missing cleanup. Fixes: 54618888d1ea ("drm/amd/display: break down dc_link.c") Signed-off-by: WenTao Liang <vulab@iscas.ac.cn> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Link: https://patch.msgid.link/20260626124555.36910-1-vulab@iscas.ac.cn Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 45510cf662dcf46b5d8926d454f338809f107b9d) Cc: stable@vger.kernel.org
6 days	drm/amd/pm: fix smu13 power limit range calculation	Yang Wang
	SMU13 reports SocketPowerLimitAc/Dc as the default power limit, but MsgLimits.Power may carry a different firmware bound for the same PPT throttler. Using only the socket limit for both min and max can therefore expose an incorrect power range. Keep the socket limit as the default, but derive the range from both values: use the lower value for the min base and the higher value for the max base before applying OD percentages. Keep the current limit query independent from the cap calculation. Fixes: 1eaf26db9590 ("drm/amd/pm: fix smu13 power limit default/cap calculation") Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/5419 Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Kenneth Feng <kenneth.feng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit f45bbf0f62f266ed8422d84f347d75d5fca846a7) Cc: stable@vger.kernel.org
6 days	drm/amdgpu: flush pending RCU callbacks on module unload	Perry Yuan
	Call rcu_barrier() in module exit to wait for outstanding call_rcu() callbacks before freeing module text, preventing late callback execution in freed memory. BUG: unable to handle page fault for address: ffffffffc1d59c40 PGD 6a12067 P4D 6a12067 PUD 6a14067 PMD 13698b067 PTE 0 Oops: 0010 [#1] SMP NOPTI RIP: 0010:0xffffffffc1d59c40 Code: Unable to access opcode bytes at RIP 0xffffffffc1d59c16. RSP: 0018:ffffc900198c0f28 EFLAGS: 00010286 RAX: ffffffffc1d59c40 RBX: ffff897c7d6b61c0 RCX: ffff88826aff4590 RDX: ffff8884d8b35490 RSI: ffffc900198c0f30 RDI: ffff88812af67290 RBP: 000000000000000a (DONE segment entries) R08: 0000000000000000 R09: 0000000000000100 R10: 0000000000000000 R11: ffffffff82a06100 R12: ffff88811a4e3700 R13: 0000000000000000 R14: ffff897c7d6b6270 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff897c7d680000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffc1d59c16 CR3: 00000104a980a001 CR4: 0000000002770ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> ? rcu_do_batch+0x163/0x450 ? rcu_core+0x177/0x1c0 ? __do_softirq+0xc1/0x280 ? asm_call_irq_on_stack+0xf/0x20 </IRQ> ? do_softirq_own_stack+0x37/0x50 ? irq_exit_rcu+0xc4/0x100 ? sysvec_apic_timer_interrupt+0x36/0x80 ? asm_sysvec_apic_timer_interrupt+0x12/0x20 ? cpuidle_enter_state+0xd4/0x360 ? cpuidle_enter+0x29/0x40 ? cpuidle_idle_call+0x108/0x1a0 ? do_idle+0x77/0xf0 ? cpu_startup_entry+0x19/0x20 ? secondary_startup_64_no_verify+0xbf/0xcb Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit feaa5039f6c12acc9aa934c2d45dcd251a12c69f)
6 days	drm/amdgpu: Fix AMDGPU_GTT_MAX_TRANSFER_SIZE for non-4K systems	Donet Tom
	Running RCCL unit tests on a system with a 64K PAGE_SIZE triggers the following warning and causes the test to terminate on latest upstream kernel: WARNING: drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1335 at amdgpu_bo_release_notify+0x1bc/0x280 [amdgpu], CPU#18: rccl-UnitTests/33151 Call trace: amdgpu_bo_release_notify ttm_bo_release amdgpu_gem_object_free drm_gem_object_free amdgpu_bo_unref amdgpu_bo_create amdgpu_bo_create_user amdgpu_gem_object_create amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu kfd_ioctl_alloc_memory_of_gpu kfd_ioctl sys_ioctl The warning is triggered because amdgpu_ttm_next_clear_entity() returns NULL when a clear buffer operation is requested. This happens because the GART window allocation for the default_entity, clear_entity and move_entity fails during initialization. Commit [1] introduced separate GART windows for the default_entity, clear_entity and move_entity of each SDMA instance. Their sizes are derived from AMDGPU_GTT_MAX_TRANSFER_SIZE, which is currently defined as 1024 pages. This implicitly assumes a 4K PAGE_SIZE, where 1024 pages correspond to a 4MB transfer. On a 64K PAGE_SIZE system, however, the same value expands to 64MB. The default_entity and clear_entity each allocate one AMDGPU_GTT_MAX_TRANSFER_SIZE GART window, while the move_entity allocates two such windows. This results in 16MB of GART space per SDMA instance on a 4K PAGE_SIZE system, but 256MB per SDMA instance on a 64K PAGE_SIZE system. On an MI210 system with five SDMA instances and a 512MB GART aperture, the total GART space required becomes 1.25GB, exceeding the available GART aperture. Consequently, GART window allocation fails, amdgpu_ttm_next_clear_entity() returns NULL, and the above warning is triggered. Redefine AMDGPU_GTT_MAX_TRANSFER_SIZE in bytes instead of page units. Where a page count is required, convert it using PAGE_SHIFT. This preserves the existing 4MB transfer size across all PAGE_SIZE configurations while keeping GART window allocations within the available GART aperture. [1] https://lore.kernel.org/all/20260408100327.1372-3-pierre-eric.pelloux-prayer@amd.com/#t Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/5435 Fixes: 897ee11ec020 ("drm/amdgpu: create multiple clear/move ttm entities") Signed-off-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 27213b776a666d3030de5acc3cd75278197b0494) Cc: stable@vger.kernel.org
6 days	drm/amdkfd: Use kvcalloc to allocate arrays	David Francis
	There were a few instances in kfd_chardev.c of kvzalloc being used to allocate memory for an array. Switch those to kvcalloc, which - is the standard way of allocating a zero-initialized array - does a check for the mul overflowing Signed-off-by: David Francis <David.Francis@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 60b048c93f7a3add39757ad65fe2bb6e58eeae23) Cc: stable@vger.kernel.org
6 days	drm/amdgpu: add support for GC IP version 11.7.1	Granthali Vinodkumar Dhandar
	Initialize GC IP 11_7_1 Signed-off-by: Granthali Vinodkumar Dhandar <granthali.vinodkumardhandar@amd.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit a928d8d81ec5cdb5a8944d08136720811efad0f6)
6 days	drm/amdgpu: add support for GC IP version 11.7.0	Granthali Vinodkumar Dhandar
	Initialize GC IP 11_7_0 Signed-off-by: Granthali Vinodkumar Dhandar <granthali.vinodkumardhandar@amd.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit cf591e67c095542a16475df293ec7bc9a118e4ee)
6 days	drm/amdgpu: add the doorbell index input for suspending userq	Prike Liang
	It requires inputing the doorbell offset for MES firmware preempts the userq, and adding the doorbell offset also keep aliging with the union MESAPI__SUSPEND in MES firmware. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit bc434335ab3c096a33a9e88c7951b4ac574db458) Cc: stable@vger.kernel.org
6 days	drm/amdgpu/mes12: set doorbell offset for suspending userq	Prike Liang
	Updating the union MESAPI__SUSPEND and union MESAPI__RESUME to add the doorbell offset for suspending userq. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 5b58a2c120063544869d0284d3b355527f9f04f5) Cc: stable@vger.kernel.org
6 days	drm/amdgpu/mes11: set doorbell offset for suspending userq	Prike Liang
	Updating the union MESAPI__SUSPEND and union MESAPI__RESUME to add the doorbell offset for suspending userq. Signed-off-by: Prike Liang <Prike.Liang@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 30af09db33696f7e0de5c0c505cbb0cb92b6e25b) Cc: stable@vger.kernel.org
6 days	drm/amdgpu: fix check in amdgpu_hmm_invalidate_gfx	Christian König
	For a short moment during alloc/free the userptr BO is not part of his VM, so bo->vm_bo can be NULL. Keep a reference to the VM root PD as parent of the userptr BO so that we can always use that to wait for all submissions of the VM instead of only the one involving the userptr BO. Signed-off-by: Christian König <christian.koenig@amd.com> Fixes: 91250893cbaa ("drm/amdgpu: fix waiting for all submissions for userptrs") Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/5399 Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 631849ff5d603841e74f19f4a5e30fe1f7d7cf30) Cc: stable@vger.kernel.org
6 days	drm/amdgpu/jpeg: fix jpeg_v5_0_1_is_idle detection	Boyuan Zhang
	jpeg_v5_0_1_is_idle() initializes ret to false and then accumulates ring idle status using &=. Since false & condition always remains false, the function can never report the JPEG block as idle. Initialize ret to true so the function returns true only when all JPEG rings report RB_JOB_DONE. Signed-off-by: Boyuan Zhang <boyuan.zhang@amd.com> Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 680adf5faeeabb4585f7aeb53681719e2d6c2f41) Cc: stable@vger.kernel.org
6 days	drm/amdgpu: Rename moved state to needs_update	Natalie Vock
	This state can be reached via other means than physical moves, like PRT bindings. Make the name match the actual purpose of the state. Signed-off-by: Natalie Vock <natalie.vock@gmx.de> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 1f7a795fb9f8186bd81ca9c4a80f75482db53c9e)
6 days	drm/amdgpu: Only set bo->moved when the BO was actually moved	Natalie Vock
	The "moved" VM state is a bit unfortunately named, because BOs can end up in this state without being physically moved. While we need to invalidate every mapping when BOs are physically moved, in some other cases like PRT binds/unbinds there is no need to refresh mappings except those affected by the bind. Full invalidation of all BO mappings manifested as severe regressions in PRT bind performance, which this patch fixes. The offending patch is 4cdbba5a16aa ("drm/amdgpu: restructure VM state machine v4") in the amd-staging-drm-next tree, although it has not yet propagated anywhere else. Fixes: 4cdbba5a16aa ("drm/amdgpu: restructure VM state machine v4") Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/5437 Signed-off-by: Natalie Vock <natalie.vock@gmx.de> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 0b2fa33b4235991a100dd799c891cf5c242aaed1) Cc: stable@vger.kernel.org
6 days	drm/amd/display: guard against overflow in HDCP message dump	Harry Wentland
	[Why] mod_hdcp_dump_binary_message() computed target_size (a uint32_t) as roughly byte_size * msg_size and gated the whole write on buf_size >= target_size. A large msg_size can overflow target_size, wrapping it to a small value that passes the check while the loop still writes byte_size * msg_size bytes into buf. All current callers pass small constants so this is not reachable today, but the unchecked arithmetic should be hardened. [How] Drop the overflow-prone target_size precomputation and instead bounds-check the output position on every iteration, stopping once the next entry would not leave room for the trailing terminator. This cannot overflow and, for oversized messages, dumps as much as fits rather than printing nothing. Fixes: 4c283fdac08a ("drm/amd/display: Add HDCP module") Assisted-by: Copilot:claude-opus-4.8 Reviewed-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: George Zhang <george.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit d0a775e5d70b376696245a14c09e3aa6dde0023a) Cc: stable@vger.kernel.org
6 days	drm/amd/display: use kvzalloc to allocate struct dc	Honglei Huang
	struct dc has grown large over time (most of it the two inlined dc_scratch_space copies) and now sits close to the page allocator's 4 MiB contiguous allocation limit. Its actual size is not fixed by the source alone, it also depends on the compiler and the .config, so it can easily cross 4 MiB, e.g. with a newer GCC or a config change. dc_create() allocates it with kzalloc(). Once struct dc exceeds 4 MiB the request is rounded up to order 11 (8 MiB), which is above MAX_PAGE_ORDER, so the page allocator warns and returns NULL. dc_create() then fails, DM init fails and amdgpu probe aborts with -EINVAL: WARNING: mm/page_alloc.c:5197 at __alloc_frozen_pages_noprof+0x2f9/0x380 dc_create+0x38/0x660 [amdgpu] amdgpu_dm_init+0x2d9/0x510 [amdgpu] dm_hw_init+0x1b/0x90 [amdgpu] amdgpu_device_init.cold+0x150d/0x1e13 [amdgpu] amdgpu_driver_load_kms+0x19/0x80 [amdgpu] amdgpu_pci_probe+0x1e2/0x4c0 [amdgpu] dc_create() then returns NULL and DM init fails, which aborts the whole GPU init and makes amdgpu probe fail with -EINVAL ("hw_init of IP block <dm> failed -22"), leaving the display unusable. The subsequent amdgpu_irq_put() warnings during teardown are just fallout of unwinding a half-initialized device. struct dc is a software-only bookkeeping structure that is never handed to hardware DMA and is only ever kept as an opaque pointer, so it does not require physically contiguous memory. Allocate it with kvzalloc() (and free it with kvfree()) so that the allocator can fall back to vmalloc() when a contiguous allocation of that size is not available, which also avoids the MAX_PAGE_ORDER warning entirely. v2: - Rebase to amd-staging-drm-next. Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/5406 Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Honglei Huang <honghuan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 991e0516a8072f2292681c6ae98a924ab0e32575) Cc: stable@vger.kernel.org
6 days	drm/amdgpu: invoke pm_genpd_remove() before freeing genpd	Ce Sun
	Call pm_genpd_remove() to unregister from global list prior to releasing acp_genpd memory, and clear the pointer after free. Signed-off-by: Ce Sun <cesun102@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit cd8650d7a91ee8b768e202354672553faa5cc1f2) Cc: stable@vger.kernel.org
6 days	drm/amdgpu: fix resource leak on ACP reset timeout	Ce Sun
	When ACP soft reset poll times out, original code returns early without cleanup, leaking MFD child devices, genpd links and all ACP heap allocations. Replace direct early return with goto out to force run all cleanup logic regardless of reset success, preserve timeout error code for caller. Signed-off-by: Ce Sun <cesun102@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 98073e4328d7a8d75d03696ab27f6de70ef1aeda) Cc: stable@vger.kernel.org
6 days	drm/amdgpu: reject mapping a reserved doorbell to a new queue	Zhu Lingshan
	When creating an user-queue, the user space provides a doorbell BO handle and an offset within the bo to obtain a doorbell. However current implementation using xa_store_irq() to store a doorbell, which allows a later queue created with the same BO and offset parameters to overwrite an existing queue and doorbell mapping. This can cause problems like misrouting fence IRQ processing to a wrong queue, and mislead the cleanup process of one queue erasing the mapping of another queue. This commit fixes this issue by replacing xa_store_irq with xa_insert_irq, which rejects mapping a reserved doorbell to a newly created queue Signed-off-by: Zhu Lingshan <lingshan.zhu@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 6244eae22966350db52faf9c1369d3b2ffc5de4e) Cc: stable@vger.kernel.org
6 days	drm/amd/display: Handle struct drm_plane_state.ignore_damage_clips	Thomas Zimmermann
	The mode-setting pipeline can disabled damage clippings for a commit by setting ignore_damage_clips in struct drm_plane_state. The commit will then do a full display update. Test the flag in DCN code and do a full update in DCN code if it has been set. Commit 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers to ignore damage clips") introduced ignore_damage_clips to selectively ignore damage clipping in certain framebuffer changes. This driver does not do that, but DRM's damage iterator will soon rely on the flag. Therefore supporting it here as well make sense for consistency. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Fixes: 35ed38d58257 ("drm: Allow drivers to indicate the damage helpers to ignore damage clips") Cc: Javier Martinez Canillas <javierm@redhat.com> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Zack Rusin <zackr@vmware.com> Cc: dri-devel@lists.freedesktop.org Reviewed-by: Javier Martinez Canillas <javierm@redhat.com> Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit a24019f6480fad5c077b5956eed942c8960323d6) Cc: <stable@vger.kernel.org> # v6.8+
6 days	drm/amdgpu/gfx12: fix EOP interrupt routing for KQ and userq	Jesse Zhang
	Try KQ by ring_id first (KCQ and UQ never share a HW slot); fall back to amdgpu_userq_process_fence_irq() on miss, since KCQ EOPs were misrouted into the userq fence path when enable_mes is true. Require a strict (me,pipe,queue) match in the gfx case, then userq gfx EOPs fall through to amdgpu_userq_process_fence_irq(). Suggested-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 6c1f4f7ff08448e0e18cd7fc4e59d6c96a36f25d) Cc: stable@vger.kernel.org
6 days	drm/amdgpu/gfx11: fix EOP interrupt routing for KQ and userq	Jesse Zhang
	Try KQ by ring_id first (KCQ and UQ never share a HW slot); fall back to amdgpu_userq_process_fence_irq() on miss, since KQ EOPs were misrouted into the userq fence path when enable_mes is true. Require a strict (me,pipe,queue) match in the gfx case, then userq gfx EOPs fall through to amdgpu_userq_process_fence_irq(). Suggested-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 88e589cc811ba907209a426c426c469bcb4bb894) Cc: stable@vger.kernel.org
6 days	drm/amdkfd: clamp v9 CRIU control stack checkpoint copy to BO size	Yongqiang Sun
	CRIU checkpoint copies the MQD control stack using cp_hqd_cntl_stack_size from hardware without bounding it to the allocated BO region. If the HW field is larger than the queue's control stack allocation, memcpy reads past the BO into adjacent GTT memory and can leak kernel data to userspace. Store the page-aligned control stack BO size in mqd_manager and clamp checkpoint copies and reported checkpoint sizes to min(cp_hqd_cntl_stack_size, mm->ctl_stack_size). Apply the same bound for multi-XCC v9.4.3 checkpoint layout. Signed-off-by: Yongqiang Sun <Yongqiang.Sun@amd.com> Reviewed-by: David Francis <David.Francis@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 6c2abd0ec09e86c6323010673766f76050e28aa3) Cc: stable@vger.kernel.org
6 days	drm/amdgpu: fix aperture mapping leak	Asad Kamal
	amdgpu_pci_remove() calls drm_dev_unplug() before invoking the driver fini routines. This causes drm_dev_enter() in amdgpu_ttm_fini() to always return false, so iounmap(aper_base_kaddr) never runs on normal driver unload, leaving an orphaned entry in the x86 PAT interval tree. On connected_to_cpu hardware, the aperture is mapped write-back (WB) via ioremap_cache(). On reload, IP discovery calls memremap(..., MEMREMAP_WC) over the same range. The WC vs WB conflict causes: ioremap error for 0x..., requested 0x1, got 0x0 amdgpu: discovery failed: -2 Fix by switching to devres-managed mappings so cleanup is guaranteed regardless of drm_dev_enter() state: - connected_to_cpu path: devm_memremap(MEMREMAP_WB). For IORESOURCE_SYSTEM_RAM ranges this takes the try_ram_remap() shortcut, returning __va(offset) from the existing kernel direct map. No new ioremap VA or PAT entry is created, so there is nothing to orphan. - dGPU path: devm_ioremap_wc() registers iounmap() as a devres action, guaranteeing cleanup at device_del() time. Also remove iounmap(aper_base_kaddr) from amdgpu_device_unmap_mmio() since the mapping is now devres-owned. v2: Remove redundant x86_64 guard (Lijo) Fixes: 9d0af8b4def0 ("drm/amdgpu: pre-map device buffer as cached for A+A config") Signed-off-by: Asad Kamal <asad.kamal@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit d871e99879cb5fd1fa798b006b4888887e63a17a) Cc: stable@vger.kernel.org
6 days	drm/amd/display: avoid large stack allocation in ↵	Arnd Bergmann
	commit_planes_do_stream_update_sequence The function has two arrays on the stack to hold temporary dsc_optc_config and dsc_config objects. The combination blows through common stack frame warning limits in combination with the other local variables: drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c:4070:22: error: stack frame size (1352) exceeds limit (1280) in 'commit_planes_do_stream_update_sequence' [-Werror,-Wframe-larger-than] Since neither array is initialized or used outside of the add_link_update_dsc_config_sequence() function, there is no actual need to keep each element around. Replace the arrays with a single instance each to reduce the stack usage to less than half. Fixes: 9f49d3cd7e71 ("drm/amd/display: Implement block sequencing infrastructure for modular hardware operations.") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Tested-by: Dan Wheeler <daniel.wheeler@amd.com> Acked-by: George Zhang <george.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 9e0896fa6f7dbe9ca3dbbd3b593fa91670f4820b) Cc: stable@vger.kernel.org
6 days	drm/amd/display: Remove DCCG registers not needed in DCN42	Matthew Stewart
	[why] Some resources that exist in the DCN block are not needed and shouldn't be used. [how] Remove defines from register lists. Reviewed-by: Ovidiu (Ovi) Bunea <ovidiu.bunea@amd.com> Signed-off-by: Matthew Stewart <Matthew.Stewart2@amd.com> Signed-off-by: George Zhang <george.zhang@amd.com> Tested-by: Dan Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit dac8aa629a45e34027444f74d3b86b6f104b024c)
6 days	drm/amd/display: Fix DCN42 null registers & register masks	Matthew Stewart
	[why] The register lists used on DCN42 variants are different. Some reused codepaths are trying to access registers not used. [how] Add DISPCLK_FREQ_CHANGECNTL, HUBPREQ_DEBUG, and HDMISTREAMCLK_CNTL to the register lists. Reviewed-by: Ovidiu (Ovi) Bunea <ovidiu.bunea@amd.com> Signed-off-by: Matthew Stewart <Matthew.Stewart2@amd.com> Signed-off-by: George Zhang <george.zhang@amd.com> Tested-by: Dan Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 64142f9d51aff32f4130d916cb8f044a072ad27d)
6 days	drm/amdkfd: Guard m->cp_hqd_eop_control setting by q->eop_ring_buffer_size	Xiaogang Chen
	To avoid wraparound if the value is 0. Signed-off-by: Xiaogang Chen <xiaogang.chen@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit c0cae35661868af207077a4306bc42c7c972947c) Cc: stable@vger.kernel.org
6 days	drm/amdgpu/vce: fix integer overflow in image size	Boyuan Zhang
	Fix a security vulnerability where malicious VCE command streams with oversized dimensions (e.g. 65536×65536) cause 32-bit integer overflow, wrapping the calculated buffer size to 0. This bypasses validation and allows GPU firmware to perform out-of-bound memory access. The fix uses 64-bit arithmetic to detect overflow and rejects invalid dimensions before they reach the hardware. V2: remove redundant check V3: modify max height value V4: remove size64 Signed-off-by: Boyuan Zhang <boyuan.zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit cbe408dba581755ad1279a487ec786d8927d778d) Cc: stable@vger.kernel.org
6 days	drm/amdgpu/vcn4: avoid rereading IB param length	Boyuan Zhang
	Reuse the parameter length returned by vcn_v4_0_enc_find_ib_param() instead of rereading it from the IB. This avoids a potential TOCTOU issue if the IB contents change between reads. Signed-off-by: Boyuan Zhang <boyuan.zhang@amd.com> Reviewed-by: David Rosca <david.rosca@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit dbb02b4755f8c1f3773263f2d779872c1c0c073a) Cc: stable@vger.kernel.org
6 days	drm/amdgpu: fix division by zero with invalid uvd dimensions	Boyuan Zhang
	When width or height is less than 16, width_in_mb or height_in_mb becomes 0, leading to fs_in_mb being 0. This causes a division by zero when calculating num_dpb_buffer in H264 and H264 Perf decode paths. Add validation to reject frames with width < 16 or height < 16 before performing any calculations that depend on these values. V2: Format change - move up all vaiable definitions. V3: Use warn_once to avoid spam. Signed-off-by: Boyuan Zhang <boyuan.zhang@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 3e41d26c70b0a459d041cc19482a226c4b7423cb) Cc: stable@vger.kernel.org
6 days	drm/amd/display: set MSA MISC1 bit 6 when using VSC SDP for DCE 11.x	Leorize
	When BT.2020 colorimetry is selected, the driver sends information using VSC SDP but does not set "ignore MSA colorimetry" bit on older GPUs with DCE-based IPs. This causes certain sinks to prefer colorimetry information in DP MSA, resulting in terrible color rendering ("dull" colors) when HDR is enabled. This commit wires up the MISC1 bit 6 for GPUs with DCE 11.x based IPs to correctly configure sinks to ignore colorimetry information in MSA, resolving the color rendering issue. Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/4849 Assisted-by: oh-my-pi:GPT-5.5 Signed-off-by: Leorize <leorize+oss@disroot.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 323a09e56c1d549ce47d4f110de77b0051b4a8bf) Cc: stable@vger.kernel.org
6 days	drm/amd/pm: fix amdgpu_pm_info power display units	Yang Wang
	amdgpu_pm_info displayed power sensor readings with the wrong fractional unit. It treated the low byte of the raw sensor value as the decimal part of watts, while that field represents milliwatts in the decoded value. As a result, debugfs could report misleading SoC power when the remainder was not already a two-digit centiwatt value. Example with query = 0x00000354: raw field value --------------------- query >> 8 3 W query & 0xff 84 mW decoded power 3084 mW output value --------------------- before 3.84 W after 3.08 W Fixes: f0b8f65b4825 ("drm/amd/amdgpu: fix the GPU power print error in pm info") Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 01992b121fb652c753d37e0c1427a2d1a557d2b1) Cc: stable@vger.kernel.org
6 days	drm/amd/pm: make pp_features read-only when scpm is enabled	Yang Wang
	SCPM owns power feature control when enabled. Make pp_features read-only during sysfs setup by clearing its write bits and store callback. Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 6a5786e191fdce36c5db170e5209cf609e8f0087) Cc: stable@vger.kernel.org
6 days	drm/amdgpu/sdma7.1: replace BUG_ON() with WARN_ON()	Alex Deucher
	There's no need to crash the kernel for these cases. Reviewed-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit c4f230b51cf2d3e7e8b1c800331f3dbed2a9e3f5) Cc: stable@vger.kernel.org
6 days	drm/amdgpu/sdma7.0: replace BUG_ON() with WARN_ON()	Alex Deucher
	There's no need to crash the kernel for these cases. Reviewed-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 9723a8bed3aa251a26bee4583bac9d8fb064dd44) Cc: stable@vger.kernel.org
6 days	drm/amdgpu/sdma6.0: replace BUG_ON() with WARN_ON()	Alex Deucher
	There's no need to crash the kernel for these cases. Reviewed-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit c17a508a7d652da3728f8bbc481bfffe96d65a87) Cc: stable@vger.kernel.org
6 days	drm/amdgpu/sdma5.2: replace BUG_ON() with WARN_ON()	Alex Deucher
	There's no need to crash the kernel for these cases. Reviewed-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit ae658afc7f47f6147371ec42cc6b1a793dfdb5af) Cc: stable@vger.kernel.org
6 days	drm/amdgpu/sdma5.0: replace BUG_ON() with WARN_ON()	Alex Deucher
	There's no need to crash the kernel for these cases. Reviewed-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 8d144a0eb09537055841af48c9e7c2d4cd48e84d) Cc: stable@vger.kernel.org
6 days	drm/amdgpu/sdma4.4.2: replace BUG_ON() with WARN_ON()	Alex Deucher
	There's no need to crash the kernel for these cases. Reviewed-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit fa4f86a148271e325e95287630a3a15a9cd35fdc) Cc: stable@vger.kernel.org
6 days	drm/amdgpu/gfx12.1: replace BUG_ON() with WARN_ON()	Alex Deucher
	There's no need to crash the kernel for these cases. Reviewed-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit e4d99e04b2e9b13b97d3b17804c735f62689db23) Cc: stable@vger.kernel.org
6 days	drm/amdgpu/gfx12: replace BUG_ON() with WARN_ON()	Alex Deucher
	There's no need to crash the kernel for these cases. Reviewed-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit f952076f76d62f783e8ba4995a7c400d39354ccf) Cc: stable@vger.kernel.org
6 days	drm/amdgpu/gfx11: replace BUG_ON() with WARN_ON()	Alex Deucher
	There's no need to crash the kernel for these cases. Reviewed-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit daa62107452d2451787c4248ca38fa2d1a0cbefd) Cc: stable@vger.kernel.org
6 days	drm/amdgpu/gfx10: replace BUG_ON() with WARN_ON()	Alex Deucher
	There's no need to crash the kernel for these cases. Reviewed-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit ac6f00beb658239bced4aaed9efbb04a35348d48) Cc: stable@vger.kernel.org
6 days	drm/amdgpu/gfx9.4.3: replace BUG_ON() with WARN_ON()	Alex Deucher
	There's no need to crash the kernel for these cases. Reviewed-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 5676593d08998d7a6d9e2d51d6b54b3820e3755c) Cc: stable@vger.kernel.org