summaryrefslogtreecommitdiff
path: root/drivers/gpu
AgeCommit message (Collapse)Author
4 daysMerge tag 'drm-intel-fixes-2026-04-09' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes - Drop check for changed VM in EXECBUF - Fix refcount underflow race in intel_engine_park_heartbeat - Do not use pipe_src as borders for SU area in PSR Signed-off-by: Dave Airlie <airlied@redhat.com> From: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patch.msgid.link/add6fPHRC7Bc8Uri@jlahtine-mobl
4 daysMerge tag 'drm-misc-fixes-2026-04-09' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes Several fixes for v3d about memory leak, runtime PM, and locking, and a Kconfig improvement for ethosu. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <mripard@redhat.com> Link: https://patch.msgid.link/20260409-omniscient-tomato-coucal-edbadc@penduick
5 daysdrm/i915/gem: Drop check for changed VM in EXECBUFJoonas Lahtinen
Since the introduction of d4433c7600f7 ("drm/i915/gem: Use the proto-context to handle create parameters (v5)") it has not been possible for VM to change after context creation so the check will never fail. Sima's analysis: This check was added in f7ce8639f6ff ("drm/i915/gem: Split the context's obj:vma lut into its own mutex") but without any hint in the commit message as to why. In another hunk of that commit there's a hint though in __eb_add_lut: /* user racing with ctx set-vm */ This would mean that this bug was introduced in e0695db7298e ("drm/i915: Create/destroy VM (ppGTT) for use with contexts"), which allowed to change the gem_ctx->vm at runtime, opening up the race that was partially fixed in the earlier referenced commit about a year later. But it cannot be exploited anymore in anything remotely recent because with the introduction of proto-contexts we've made gem_ctx->vm invariant again, exactly to preemptively close all these potential issues. Specifically d4433c7600f7 ("drm/i915/gem: Use the proto-context to handle create parameters (v5)") is the vm specific part of the proto-context work. v3: - Include Sima's analysis and WARN_ON_ONCE v4: - Focus only on latest mainline codebase References: https://lore.kernel.org/all/20260324151741.29338-1-sosohero200@gmail.com/ Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: Linus Torvalds <torvalds@linuxfoundation.org> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Andi Shyti <andi.shyti@kernel.org> Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Link: https://patch.msgid.link/20260409053111.8914-1-joonas.lahtinen@linux.intel.com (cherry picked from commit f6d4afc9ec6a0bc326151b35a7a3369369180079) Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
6 daysdrm/i915/gt: fix refcount underflow in intel_engine_park_heartbeatSebastian Brzezinka
A use-after-free / refcount underflow is possible when the heartbeat worker and intel_engine_park_heartbeat() race to release the same engine->heartbeat.systole request. The heartbeat worker reads engine->heartbeat.systole and calls i915_request_put() on it when the request is complete, but clears the pointer in a separate, non-atomic step. Concurrently, a request retirement on another CPU can drop the engine wakeref to zero, triggering __engine_park() -> intel_engine_park_heartbeat(). If the heartbeat timer is pending at that point, cancel_delayed_work() returns true and intel_engine_park_heartbeat() reads the stale non-NULL systole pointer and calls i915_request_put() on it again, causing a refcount underflow: ``` <4> [487.221889] Workqueue: i915-unordered engine_retire [i915] <4> [487.222640] RIP: 0010:refcount_warn_saturate+0x68/0xb0 ... <4> [487.222707] Call Trace: <4> [487.222711] <TASK> <4> [487.222716] intel_engine_park_heartbeat.part.0+0x6f/0x80 [i915] <4> [487.223115] intel_engine_park_heartbeat+0x25/0x40 [i915] <4> [487.223566] __engine_park+0xb9/0x650 [i915] <4> [487.223973] ____intel_wakeref_put_last+0x2e/0xb0 [i915] <4> [487.224408] __intel_wakeref_put_last+0x72/0x90 [i915] <4> [487.224797] intel_context_exit_engine+0x7c/0x80 [i915] <4> [487.225238] intel_context_exit+0xf1/0x1b0 [i915] <4> [487.225695] i915_request_retire.part.0+0x1b9/0x530 [i915] <4> [487.226178] i915_request_retire+0x1c/0x40 [i915] <4> [487.226625] engine_retire+0x122/0x180 [i915] <4> [487.227037] process_one_work+0x239/0x760 <4> [487.227060] worker_thread+0x200/0x3f0 <4> [487.227068] ? __pfx_worker_thread+0x10/0x10 <4> [487.227075] kthread+0x10d/0x150 <4> [487.227083] ? __pfx_kthread+0x10/0x10 <4> [487.227092] ret_from_fork+0x3d4/0x480 <4> [487.227099] ? __pfx_kthread+0x10/0x10 <4> [487.227107] ret_from_fork_asm+0x1a/0x30 <4> [487.227141] </TASK> ``` Fix this by replacing the non-atomic pointer read + separate clear with xchg() in both racing paths. xchg() is a single indivisible hardware instruction that atomically reads the old pointer and writes NULL. This guarantees only one of the two concurrent callers obtains the non-NULL pointer and performs the put, the other gets NULL and skips it. Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/work_items/15880 Fixes: 058179e72e09 ("drm/i915/gt: Replace hangcheck by heartbeats") Cc: <stable@vger.kernel.org> # v5.5+ Signed-off-by: Sebastian Brzezinka <sebastian.brzezinka@intel.com> Reviewed-by: Krzysztof Karas <krzysztof.karas@intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://lore.kernel.org/r/d4c1c14255688dd07cc8044973c4f032a8d1559e.1775038106.git.sebastian.brzezinka@intel.com (cherry picked from commit 13238dc0ee4f9ab8dafa2cca7295736191ae2f42) Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
7 daysdrm/xe: Fix bug in idledly unit conversionVinay Belgaumkar
We only need to convert to picosecond units before writing to RING_IDLEDLY. Fixes: 7c53ff050ba8 ("drm/xe: Apply Wa_16023105232") Cc: Tangudu Tilak Tirumalesh <tilak.tirumalesh.tangudu@intel.com> Acked-by: Tangudu Tilak Tirumalesh <tilak.tirumalesh.tangudu@intel.com> Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Link: https://patch.msgid.link/20260401012710.4165547-1-vinay.belgaumkar@intel.com (cherry picked from commit 13743bd628bc9d9a0e2fe53488b2891aedf7cc74) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
7 daysdrm/i915/psr: Do not use pipe_src as borders for SU areaJouni Högander
This far using crtc_state->pipe_src as borders for Selective Update area haven't caused visible problems as drm_rect_width(crtc_state->pipe_src) == crtc_state->hw.adjusted_mode.crtc_hdisplay and drm_rect_height(crtc_state->pipe_src) == crtc_state->hw.adjusted_mode.crtc_vdisplay when pipe scaling is not used. On the other hand using pipe scaling is forcing full frame updates and all the Selective Update area calculations are skipped. Now this improper usage of crtc_state->pipe_src is causing following warnings: <4> [7771.978166] xe 0000:00:02.0: [drm] drm_WARN_ON_ONCE(su_lines % vdsc_cfg->slice_height) after WARN_ON_ONCE was added by commit: "drm/i915/dsc: Add helper for writing DSC Selective Update ET parameters" These warnings are seen when DSC and pipe scaling are enabled simultaneously. This is because on full frame update SU area is improperly set as pipe_src which is not aligned with DSC slice height. Fix these by creating local rectangle using crtc_state->hw.adjusted_mode.crtc_hdisplay and crtc_state->hw.adjusted_mode.crtc_vdisplay. Use this local rectangle as borders for SU area. Fixes: d6774b8c3c58 ("drm/i915: Ensure damage clip area is within pipe area") Cc: <stable@vger.kernel.org> # v6.0+ Signed-off-by: Jouni Högander <jouni.hogander@intel.com> Reviewed-by: Mika Kahola <mika.kahola@intel.com> Link: https://patch.msgid.link/20260327114553.195285-1-jouni.hogander@intel.com (cherry picked from commit da0cdc1c329dd2ff09c41fbbe9fbd9c92c5d2c6e) Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
9 daysi915: don't use a vma that didn't match the context VMLinus Torvalds
In eb_lookup_vma(), the code checks that the context vm matches before incrementing the i915 vma usage count, but for the non-matching case it didn't clear the non-matching vma pointer, so it would then mistakenly be returned, causing potential UaF and refcount issues. Reported-by: Yassine Mounir <sosohero200@gmail.com> Suggested-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11 daysMerge tag 'drm-misc-fixes-2026-04-02' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes A refcounting fix for bridges, revert a previous framebuffer use-after-free fix that turned out to be causing more problems, a hang fix for qaic, an initialization fix for ast, a error handling fix for sysfb, and a speculation fix for drm_compat_ioctl. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <mripard@redhat.com> Link: https://patch.msgid.link/20260402-vivid-perfect-caiman-ca055e@houat
11 daysMerge tag 'amd-drm-fixes-7.0-2026-04-02' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-fixes amd-drm-fixes-7.0-2026-04-02: amdgpu: - Fix audio regression on renoir Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patch.msgid.link/20260402194409.914769-1-alexander.deucher@amd.com
11 daysMerge tag 'drm-xe-fixes-2026-04-02' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes uAPI Fix: - Accept canonical GPU addresses in xe_vm_madvise_ioctl (Arvind) Driver Fixes: - Disallow writes to read-only VMAs (Jonathan) - PXP fixes (Daniele) - Disable garbage collector work item on SVM clos (Brost) - void memory allocations in xe_device_declare_wedged (Brost) Signed-off-by: Dave Airlie <airlied@redhat.com> From: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patch.msgid.link/ac5mDHs-McR5cJSV@intel.com
11 daysMerge tag 'drm-intel-fixes-2026-04-02' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes - Fix for #12045: Huawei Matebook E (DRR-WXX): Persistent Black Screen on Boot with i915 and Gen11: Modesetting and Backlight Control Malfunction - Fix for #15826: i915: Raptor Lake-P [UHD Graphics] display flicker/corruption on eDP panel - Use crtc_state->enhanced_framing properly on ivb/hsw CPU eDP Signed-off-by: Dave Airlie <airlied@redhat.com> From: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patch.msgid.link/ac5DM1IpBkuaT58e@jlahtine-mobl
12 daysdrm/amd/display: Wire up dcn10_dio_construct() for all pre-DCN401 generationsIonut Nechita
Description: - Commit b82f0759346617b2 ("drm/amd/display: Migrate DIO registers access from hwseq to dio component") moved DIO_MEM_PWR_CTRL register access behind the new dio abstraction layer but only created the dio object for DCN 4.01. On all other generations (DCN 10/20/21/201/30/301/302/303/ 31/314/315/316/32/321/35/351/36), the dio pointer is NULL, causing the register write to be silently skipped. This results in AFMT HDMI memory not being powered on during init_hw, which can cause HDMI audio failures and display issues on affected hardware including Renoir/Cezanne (DCN 2.1) APUs that use dcn10_init_hw. Call dcn10_dio_construct() in each older DCN generation's resource.c to create the dio object, following the same pattern as DCN 4.01. This ensures the dio pointer is non-NULL and the mem_pwr_ctrl callback works through the dio abstraction for all DCN generations. Fixes: b82f07593466 ("drm/amd/display: Migrate DIO registers access from hwseq to dio component.") Reviewed-by: Ivan Lipski <ivan.lipski@amd.com> Signed-off-by: Ionut Nechita <ionut_n2001@yahoo.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
12 daysdrm/vc4: Protect madv read in vc4_gem_object_mmap() with madv_lockMaíra Canal
The mmap callback reads bo->madv without holding madv_lock, racing with concurrent DRM_IOCTL_VC4_GEM_MADVISE calls that modify the field under the same lock. Add the missing locking to prevent the data race. Fixes: b9f19259b84d ("drm/vc4: Add the DRM_IOCTL_VC4_GEM_MADVISE ioctl") Reviewed-by: Melissa Wen <mwen@igalia.com> Link: https://patch.msgid.link/20260330-vc4-misc-fixes-v1-4-92defc940a29@igalia.com Signed-off-by: Maíra Canal <mcanal@igalia.com>
12 daysdrm/vc4: Fix a memory leak in hang state error pathMaíra Canal
When vc4_save_hang_state() encounters an early return condition, it returns without freeing the previously allocated `kernel_state`, leaking memory. Add the missing kfree() calls by consolidating the early return paths into a single place. Fixes: 214613656b51 ("drm/vc4: Add an interface for capturing the GPU state after a hang.") Reviewed-by: Melissa Wen <mwen@igalia.com> Link: https://patch.msgid.link/20260330-vc4-misc-fixes-v1-3-92defc940a29@igalia.com Signed-off-by: Maíra Canal <mcanal@igalia.com>
12 daysdrm/vc4: Fix memory leak of BO array in hang stateMaíra Canal
The hang state's BO array is allocated separately with kzalloc() in vc4_save_hang_state() but never freed in vc4_free_hang_state(). Add the missing kfree() for the BO array before freeing the hang state struct. Fixes: 214613656b51 ("drm/vc4: Add an interface for capturing the GPU state after a hang.") Reviewed-by: Melissa Wen <mwen@igalia.com> Link: https://patch.msgid.link/20260330-vc4-misc-fixes-v1-2-92defc940a29@igalia.com Signed-off-by: Maíra Canal <mcanal@igalia.com>
12 daysdrm/vc4: Release runtime PM reference after binding V3DMaíra Canal
The vc4_v3d_bind() function acquires a runtime PM reference via pm_runtime_resume_and_get() to access V3D registers during setup. However, this reference is never released after a successful bind. This prevents the device from ever runtime suspending, since the reference count never reaches zero. Release the runtime PM reference by adding pm_runtime_put_autosuspend() after autosuspend is configured, allowing the device to runtime suspend after the delay. Fixes: 266cff37d7fc ("drm/vc4: v3d: Rework the runtime_pm setup") Reviewed-by: Melissa Wen <mwen@igalia.com> Link: https://patch.msgid.link/20260330-vc4-misc-fixes-v1-1-92defc940a29@igalia.com Signed-off-by: Maíra Canal <mcanal@igalia.com>
12 daysdrm/ioc32: stop speculation on the drm_compat_ioctl pathGreg Kroah-Hartman
The drm compat ioctl path takes a user controlled pointer, and then dereferences it into a table of function pointers, the signature method of spectre problems. Fix this up by calling array_index_nospec() on the index to the function pointer list. Fixes: 505b5240329b ("drm/ioctl: Fix Spectre v1 vulnerabilities") Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: David Airlie <airlied@gmail.com> Cc: Simona Vetter <simona@ffwll.ch> Cc: stable <stable@kernel.org> Assisted-by: gkh_clanker_2000 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Thomas Zimmermann <tzimmermann@suse.de> Acked-by: Maxime Ripard <mripard@kernel.org> Reviewed-by: Simona Vetter <simona@ffwll.ch> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patch.msgid.link/2026032451-playing-rummage-8fa2@gregkh
2026-03-31drm/sysfb: Fix efidrm error handling and memory type mismatchChen Ni
Fix incorrect error checking and memory type confusion in efidrm_device_create(). devm_memremap() returns error pointers, not NULL, and returns system memory while devm_ioremap() returns I/O memory. The code incorrectly passes system memory to iosys_map_set_vaddr_iomem(). Restructure to handle each memory type separately. Use devm_ioremap*() with ERR_PTR(-ENXIO) for WC/UC, and devm_memremap() with ERR_CAST() for WT/WB. Fixes: 32ae90c66fb6 ("drm/sysfb: Add efidrm for EFI displays") Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patch.msgid.link/20260311064652.2903449-1-nichen@iscas.ac.cn
2026-03-31drm/i915/dp: Use crtc_state->enhanced_framing properly on ivb/hsw CPU eDPVille Syrjälä
Looks like I missed the drm_dp_enhanced_frame_cap() in the ivb/hsw CPU eDP code when I introduced crtc_state->enhanced_framing. Fix it up so that the state we program to the hardware is guaranteed to match what we computed earlier. Cc: stable@vger.kernel.org Fixes: 3072a24c778a ("drm/i915: Introduce crtc_state->enhanced_framing") Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://patch.msgid.link/20260325135849.12603-3-ville.syrjala@linux.intel.com Reviewed-by: Michał Grzelak <michal.grzelak@intel.com> (cherry picked from commit 799fe8dc2af52f35c78c4ac97f8e34994dfd8760) Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
2026-03-31drm/i915/cdclk: Do the full CDCLK dance for min_voltage_level changesVille Syrjälä
Apparently I forgot about the pipe min_voltage_level when I decoupled the CDCLK calculations from modesets. Even if the CDCLK frequency doesn't need changing we may still need to bump the voltage level to accommodate an increase in the port clock frequency. Currently, even if there is a full modeset, we won't notice the need to go through the full CDCLK calculations/programming, unless the set of enabled/active pipes changes, or the pipe/dbuf min CDCLK changes. Duplicate the same logic we use the pipe's min CDCLK frequency to also deal with its min voltage level. Note that the 'allow_voltage_level_decrease' stuff isn't really useful here since the min voltage level can only change during a full modeset. But I think sticking to the same approach in the three similar parts (pipe min cdclk, pipe min voltage level, dbuf min cdclk) is a good idea. Cc: stable@vger.kernel.org Tested-by: Mikhail Rudenko <mike.rudenko@gmail.com> Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/15826 Fixes: ba91b9eecb47 ("drm/i915/cdclk: Decouple cdclk from state->modeset") Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://patch.msgid.link/20260325135849.12603-2-ville.syrjala@linux.intel.com Reviewed-by: Michał Grzelak <michal.grzelak@intel.com> (cherry picked from commit 0f21a14987ebae3c05ad1184ea872e7b7a7b8695) Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
2026-03-30drm/amdkfd: Fix queue preemption/eviction failures by aligning control stack ↵Donet Tom
size to GPU page size The control stack size is calculated based on the number of CUs and waves, and is then aligned to PAGE_SIZE. When the resulting control stack size is aligned to 64 KB, GPU hangs and queue preemption failures are observed while running RCCL unit tests on systems with more than two GPUs. amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for queue with doorbell_id: 80030008 amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues amdgpu 0048:0f:00.0: amdgpu: GPU reset begin!. Source: 4 amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for queue with doorbell_id: 80030008 amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues amdgpu 0048:0f:00.0: amdgpu: Failed to restore process queues This issue is observed on both 4 KB and 64 KB system page-size configurations. This patch fixes the issue by aligning the control stack size to AMDGPU_GPU_PAGE_SIZE instead of PAGE_SIZE, so the control stack size will not be 64 KB on systems with a 64 KB page size and queue preemption works correctly. Additionally, In the current code, wg_data_size is aligned to PAGE_SIZE, which can waste memory if the system page size is large. In this patch, wg_data_size is aligned to AMDGPU_GPU_PAGE_SIZE. The cwsr_size, calculated from wg_data_size and the control stack size, is aligned to PAGE_SIZE. Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit a3e14436304392fbada359edd0f1d1659850c9b7)
2026-03-30drm/amdgpu: Fix wait after reset sequence in S4Lijo Lazar
For a mode-1 reset done at the end of S4 on PSPv11 dGPUs, only check if TOS is unloaded. Fixes: 32f73741d6ee ("drm/amdgpu: Wait for bootloader after PSPv11 reset") Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/4853 Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 2fb4883b884a437d760bd7bdf7695a7e5a60bba3) Cc: stable@vger.kernel.org
2026-03-30drm/amd/display: Fix NULL pointer dereference in dcn401_init_hw()Srinivasan Shanmugam
dcn401_init_hw() assumes that update_bw_bounding_box() is valid when entering the update path. However, the existing condition: ((!fams2_enable && update_bw_bounding_box) || freq_changed) does not guarantee this, as the freq_changed branch can evaluate to true independently of the callback pointer. This can result in calling update_bw_bounding_box() when it is NULL. Fix this by separating the update condition from the pointer checks and ensuring the callback, dc->clk_mgr, and bw_params are validated before use. Fixes the below: ../dc/hwss/dcn401/dcn401_hwseq.c:367 dcn401_init_hw() error: we previously assumed 'dc->res_pool->funcs->update_bw_bounding_box' could be null (see line 362) Fixes: ca0fb243c3bb ("drm/amd/display: Underflow Seen on DCN401 eGPU") Cc: Daniel Sa <Daniel.Sa@amd.com> Cc: Alvin Lee <alvin.lee2@amd.com> Cc: Roman Li <roman.li@amd.com> Cc: Alex Hung <alex.hung@amd.com> Cc: Tom Chung <chiahsuan.chung@amd.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Aurabindo Pillai <aurabindo.pillai@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Alex Hung <alex.hung@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 86117c5ab42f21562fedb0a64bffea3ee5fcd477) Cc: stable@vger.kernel.org
2026-03-30drm/amdgpu: Change AMDGPU_VA_RESERVED_TRAP_SIZE to 64KBDonet Tom
Currently, AMDGPU_VA_RESERVED_TRAP_SIZE is hardcoded to 8KB, while KFD_CWSR_TBA_TMA_SIZE is defined as 2 * PAGE_SIZE. On systems with 4K pages, both values match (8KB), so allocation and reserved space are consistent. However, on 64K page-size systems, KFD_CWSR_TBA_TMA_SIZE becomes 128KB, while the reserved trap area remains 8KB. This mismatch causes the kernel to crash when running rocminfo or rccl unit tests. Kernel attempted to read user page (2) - exploit attempt? (uid: 1001) BUG: Kernel NULL pointer dereference on read at 0x00000002 Faulting instruction address: 0xc0000000002c8a64 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries CPU: 34 UID: 1001 PID: 9379 Comm: rocminfo Tainted: G E 6.19.0-rc4-amdgpu-00320-gf23176405700 #56 VOLUNTARY Tainted: [E]=UNSIGNED_MODULE Hardware name: IBM,9105-42A POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.30 (ML1060_896) hv:phyp pSeries NIP: c0000000002c8a64 LR: c00000000125dbc8 CTR: c00000000125e730 REGS: c0000001e0957580 TRAP: 0300 Tainted: G E MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24008268 XER: 00000036 CFAR: c00000000125dbc4 DAR: 0000000000000002 DSISR: 40000000 IRQMASK: 1 GPR00: c00000000125d908 c0000001e0957820 c0000000016e8100 c00000013d814540 GPR04: 0000000000000002 c00000013d814550 0000000000000045 0000000000000000 GPR08: c00000013444d000 c00000013d814538 c00000013d814538 0000000084002268 GPR12: c00000000125e730 c000007e2ffd5f00 ffffffffffffffff 0000000000020000 GPR16: 0000000000000000 0000000000000002 c00000015f653000 0000000000000000 GPR20: c000000138662400 c00000013d814540 0000000000000000 c00000013d814500 GPR24: 0000000000000000 0000000000000002 c0000001e0957888 c0000001e0957878 GPR28: c00000013d814548 0000000000000000 c00000013d814540 c0000001e0957888 NIP [c0000000002c8a64] __mutex_add_waiter+0x24/0xc0 LR [c00000000125dbc8] __mutex_lock.constprop.0+0x318/0xd00 Call Trace: 0xc0000001e0957890 (unreliable) __mutex_lock.constprop.0+0x58/0xd00 amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x6fc/0xb60 [amdgpu] kfd_process_alloc_gpuvm+0x54/0x1f0 [amdgpu] kfd_process_device_init_cwsr_dgpu+0xa4/0x1a0 [amdgpu] kfd_process_device_init_vm+0xd8/0x2e0 [amdgpu] kfd_ioctl_acquire_vm+0xd0/0x130 [amdgpu] kfd_ioctl+0x514/0x670 [amdgpu] sys_ioctl+0x134/0x180 system_call_exception+0x114/0x300 system_call_vectored_common+0x15c/0x2ec This patch changes AMDGPU_VA_RESERVED_TRAP_SIZE to 64 KB and KFD_CWSR_TBA_TMA_SIZE to the AMD GPU page size. This means we reserve 64 KB for the trap in the address space, but only allocate 8 KB within it. With this approach, the allocation size never exceeds the reserved area. Fixes: 34a1de0f7935 ("drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole") Reviewed-by: Christian König <christian.koenig@amd.com> Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Suggested-by: Christian König <christian.koenig@amd.com> Signed-off-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 31b8de5e55666f26ea7ece5f412b83eab3f56dbb) Cc: stable@vger.kernel.org
2026-03-30drm/amdgpu/userq: fix memory leak in MQD creation error pathsJunrui Luo
In mes_userq_mqd_create(), the memdup_user() allocations for IP-specific MQD structs are not freed when subsequent VA validation fails. The goto free_mqd label only cleans up the MQD BO object and userq_props. Fix by adding kfree() before each goto free_mqd on VA validation failure in the COMPUTE, GFX, and SDMA branches. Fixes: 9e46b8bb0539 ("drm/amdgpu: validate userq buffer virtual address and size") Reported-by: Yuhao Jiang <danisjiang@gmail.com> Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 27f5ff9e4a4150d7cf8b4085aedd3b77ddcc5d08) Cc: stable@vger.kernel.org
2026-03-30drm/amd: Fix MQD and control stack alignment for non-4KDonet Tom
For gfxV9, due to a hardware bug ("based on the comments in the code here [1]"), the control stack of a user-mode compute queue must be allocated immediately after the page boundary of its regular MQD buffer. To handle this, we allocate an enlarged MQD buffer where the first page is used as the MQD and the remaining pages store the control stack. Although these regions share the same BO, they require different memory types: the MQD must be UC (uncached), while the control stack must be NC (non-coherent), matching the behavior when the control stack is allocated in user space. This logic works correctly on systems where the CPU page size matches the GPU page size (4K). However, the current implementation aligns both the MQD and the control stack to the CPU PAGE_SIZE. On systems with a larger CPU page size, the entire first CPU page is marked UC—even though that page may contain multiple GPU pages. The GPU treats the second 4K GPU page inside that CPU page as part of the control stack, but it is incorrectly mapped as UC. This patch fixes the issue by aligning both the MQD and control stack sizes to the GPU page size (4K). The first 4K page is correctly marked as UC for the MQD, and the remaining GPU pages are marked NC for the control stack. This ensures proper memory type assignment on systems with larger CPU page sizes. [1]: https://elixir.bootlin.com/linux/v6.18/source/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c#L118 Acked-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 998d6781410de1c4b787fdbf6c56e851ea7fa553)
2026-03-30drm/amdkfd: Align expected_queue_size to PAGE_SIZEDonet Tom
The AQL queue size can be 4K, but the minimum buffer object (BO) allocation size is PAGE_SIZE. On systems with a page size larger than 4K, the expected queue size does not match the allocated BO size, causing queue creation to fail. Align the expected queue size to PAGE_SIZE so that it matches the allocated BO size and allows queue creation to succeed. Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit b01cd158a2f5230b137396c5f8cda3fc780abbc2)
2026-03-30drm/amdgpu: fix the idr allocation flagsPrike Liang
Fix the IDR allocation flags by using atomic GFP flags in non‑sleepable contexts to avoid the __might_sleep() complaint. 268.290239] [drm] Initialized amdgpu 3.64.0 for 0000:03:00.0 on minor 0 [ 268.294900] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:323 [ 268.295355] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1744, name: modprobe [ 268.295705] preempt_count: 1, expected: 0 [ 268.295886] RCU nest depth: 0, expected: 0 [ 268.296072] 2 locks held by modprobe/1744: [ 268.296077] #0: ffff8c3a44abd1b8 (&dev->mutex){....}-{4:4}, at: __driver_attach+0xe4/0x210 [ 268.296100] #1: ffffffffc1a6ea78 (amdgpu_pasid_idr_lock){+.+.}-{3:3}, at: amdgpu_pasid_alloc+0x26/0xe0 [amdgpu] [ 268.296494] CPU: 12 UID: 0 PID: 1744 Comm: modprobe Tainted: G U OE 6.19.0-custom #16 PREEMPT(voluntary) [ 268.296498] Tainted: [U]=USER, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE [ 268.296499] Hardware name: AMD Majolica-RN/Majolica-RN, BIOS RMJ1009A 06/13/2021 [ 268.296501] Call Trace: Fixes: 8f1de51f49be ("drm/amdgpu: prevent immediate PASID reuse case") Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit ea56aa2625708eaf96f310032391ff37746310ef) Cc: stable@vger.kernel.org
2026-03-30drm/amdgpu: validate doorbell_offset in user queue creationJunrui Luo
amdgpu_userq_get_doorbell_index() passes the user-provided doorbell_offset to amdgpu_doorbell_index_on_bar() without bounds checking. An arbitrarily large doorbell_offset can cause the calculated doorbell index to fall outside the allocated doorbell BO, potentially corrupting kernel doorbell space. Validate that doorbell_offset falls within the doorbell BO before computing the BAR index, using u64 arithmetic to prevent overflow. Fixes: f09c1e6077ab ("drm/amdgpu: generate doorbell index for userqueue") Reported-by: Yuhao Jiang <danisjiang@gmail.com> Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit de1ef4ffd70e1d15f0bf584fd22b1f28cbd5e2ec) Cc: stable@vger.kernel.org
2026-03-30drm/amdgpu/pm: drop SMU driver if version not matched messagesAlex Deucher
It just leads to user confusion. Cc: Yang Wang <kevinyang.wang@amd.com> Cc: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit e471627d56272a791972f25e467348b611c31713) Cc: stable@vger.kernel.org
2026-03-30drm/xe: Avoid memory allocations in xe_device_declare_wedged()Matthew Brost
xe_device_declare_wedged() runs in the DMA-fence signaling path, where GFP_KERNEL memory allocations are not allowed. However, registering xe_device_wedged_fini via drmm_add_action_or_reset() triggers a GFP_KERNEL allocation. Fix this by deferring the registration of xe_device_wedged_fini until late in the driver load sequence. Additionally, drop the wedged PM reference only if the device is actually wedged in xe_device_wedged_fini. Fixes: 452bca0edbd0 ("drm/xe: Don't suspend device upon wedge") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patch.msgid.link/20260326210116.202585-2-matthew.brost@intel.com (cherry picked from commit b08ceb443866808b881b12d4183008d214d816c1) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-03-30drm/xe: Disable garbage collector work item on SVM closeMatthew Brost
When an SVM is closed, the garbage collector work item must be stopped synchronously and any future queuing must be prevented. Replace flush_work() with disable_work_sync() to ensure both conditions are met. Fixes: 63f6e480d115 ("drm/xe: Add SVM garbage collector") Cc: stable@vger.kernel.org Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patch.msgid.link/20260227015225.3081787-1-matthew.brost@intel.com (cherry picked from commit 2247feb9badca5a4774df9a437bfc44fba4f22de) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-03-30drm/xe/pxp: Don't allow PXP on older PTL GSC FWsDaniele Ceraolo Spurio
On PTL, older GSC FWs have a bug that can cause them to crash during PXP invalidation events, which leads to a complete loss of power management on the media GT. Therefore, we can't use PXP on FWs that have this bug, which was fixed in PTL GSC build 1396. Fixes: b1dcec9bd8a1 ("drm/xe/ptl: Enable PXP for PTL") Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Julia Filipchuk <julia.filipchuk@intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patch.msgid.link/20260324153718.3155504-10-daniele.ceraolospurio@intel.com (cherry picked from commit 6eb04caaa972934c9b6cea0e0c29e466bf9a346f) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-03-30drm/xe/pxp: Clear restart flag in pxp_start after jumping backDaniele Ceraolo Spurio
If we don't clear the flag we'll keep jumping back at the beginning of the function once we reach the end. Fixes: ccd3c6820a90 ("drm/xe/pxp: Decouple queue addition from PXP start") Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Julia Filipchuk <julia.filipchuk@intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patch.msgid.link/20260324153718.3155504-9-daniele.ceraolospurio@intel.com (cherry picked from commit 0850ec7bb2459602351639dccf7a68a03c9d1ee0) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-03-30drm/xe/pxp: Remove incorrect handling of impossible state during suspendDaniele Ceraolo Spurio
The default case of the PXP suspend switch is incorrectly exiting without releasing the lock. However, this case is impossible to hit because we're switching on an enum and all the valid enum values have their own cases. Therefore, we can just get rid of the default case and rely on the compiler to warn us if a new enum value is added and we forget to add it to the switch. Fixes: 51462211f4a9 ("drm/xe/pxp: add PXP PM support") Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Alan Previn Teres Alexis <alan.previn.teres.alexis@intel.com> Cc: Julia Filipchuk <julia.filipchuk@intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patch.msgid.link/20260324153718.3155504-8-daniele.ceraolospurio@intel.com (cherry picked from commit f1b5a77fc9b6a90cd9a5e3db9d4c73ae1edfcfac) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-03-30drm/xe/pxp: Clean up termination status on failureDaniele Ceraolo Spurio
If the PXP HW termination fails during PXP start, the normal completion code won't be called, so the termination will remain uncomplete. To avoid unnecessary waits, mark the termination as completed from the error path. Note that we already do this if the termination fails when handling a termination irq from the HW. Fixes: f8caa80154c4 ("drm/xe/pxp: Add PXP queue tracking and session start") Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Alan Previn Teres Alexis <alan.previn.teres.alexis@intel.com> Cc: Julia Filipchuk <julia.filipchuk@intel.com> Reviewed-by: Julia Filipchuk <julia.filipchuk@intel.com> Link: https://patch.msgid.link/20260324153718.3155504-7-daniele.ceraolospurio@intel.com (cherry picked from commit 5d9e708d2a69ab1f64a17aec810cd7c70c5b9fab) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-03-30drm/xe/madvise: Accept canonical GPU addresses in xe_vm_madvise_ioctlArvind Yadav
Userspace passes canonical (sign-extended) GPU addresses where bits 63:48 mirror bit 47. The internal GPUVM uses non-canonical form (upper bits zeroed), so passing raw canonical addresses into GPUVM lookups causes mismatches for addresses above 128TiB. Strip the sign extension with xe_device_uncanonicalize_addr() at the top of xe_vm_madvise_ioctl(). Non-canonical addresses are unaffected. Fixes: ada7486c5668 ("drm/xe: Implement madvise ioctl for xe") Suggested-by: Matthew Brost <matthew.brost@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Signed-off-by: Arvind Yadav <arvind.yadav@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260326130843.3545241-13-arvind.yadav@intel.com (cherry picked from commit 05c8b1cdc54036465ea457a0501a8c2f9409fce7) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-03-30drm/xe/xe_pagefault: Disallow writes to read-only VMAsJonathan Cavitt
The page fault handler should reject write/atomic access to read only VMAs. Add code to handle this in xe_pagefault_service after the VMA lookup. v2: - Apply max line length (Matthew) Fixes: fb544b844508 ("drm/xe: Implement xe_pagefault_queue_work") Signed-off-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Suggested-by: Matthew Brost <matthew.brost@intel.com> Cc: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260324152935.72444-7-jonathan.cavitt@intel.com (cherry picked from commit 714ee6754ac5fa3dc078856a196a6b124cd797a0) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-03-30drm/ast: dp501: Fix initialization of SCU2CThomas Zimmermann
Ast's DP501 initialization reads the register SCU2C at offset 0x1202c and tries to set it to source data from VGA. But writes the update to offset 0x0, with unknown results. Write the result to SCU instead. The bug only happens in ast_init_analog(). There's similar code in ast_init_dvo(), which works correctly. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Fixes: 83c6620bae3f ("drm/ast: initial DP501 support (v0.2)") Reviewed-by: Jocelyn Falempe <jfalempe@redhat.com> Cc: Dave Airlie <airlied@redhat.com> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Jocelyn Falempe <jfalempe@redhat.com> Cc: dri-devel@lists.freedesktop.org Cc: <stable@vger.kernel.org> # v3.16+ Link: https://patch.msgid.link/20260327133532.79696-2-tzimmermann@suse.de
2026-03-30drm/i915/dsi: Don't do DSC horizontal timing adjustments in command modeVille Syrjälä
Stop adjusting the horizontal timing values based on the compression ratio in command mode. Bspec seems to be telling us to do this only in video mode, and this is also how the Windows driver does things. This should also fix a div-by-zero on some machines because the adjusted htotal ends up being so small that we end up with line_time_us==0 when trying to determine the vtotal value in command mode. Note that this doesn't actually make the display on the Huawei Matebook E work, but at least the kernel no longer explodes when the driver loads. Cc: stable@vger.kernel.org Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12045 Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://patch.msgid.link/20260326111814.9800-2-ville.syrjala@linux.intel.com Fixes: 53693f02d80e ("drm/i915/dsi: account for DSC in horizontal timings") Reviewed-by: Jani Nikula <jani.nikula@intel.com> (cherry picked from commit 0b475e91ecc2313207196c6d7fd5c53e1a878525) Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
2026-03-28Merge tag 'mediatek-drm-fixes-20260323' of ↵Dave Airlie
https://git.kernel.org/pub/scm/linux/kernel/git/chunkuang.hu/linux into drm-fixes Mediatek DRM Fixes - 20260323 1. dsi: Store driver data before invoking mipi_dsi_host_register Signed-off-by: Dave Airlie <airlied@redhat.com> From: Chun-Kuang Hu <chunkuang.hu@kernel.org> Link: https://patch.msgid.link/20260323160135.39609-1-chunkuang.hu@kernel.org
2026-03-27Merge tag 'drm-xe-fixes-2026-03-26' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes - Fix UAF in SRIOV migration restore (Winiarski) - Updates to HW W/a (Roper) - VMBind remap fix (Auld) Signed-off-by: Dave Airlie <airlied@redhat.com> From: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patch.msgid.link/acUgq2q2DrCUzFql@intel.com
2026-03-27Merge tag 'drm-misc-fixes-2026-03-26' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes A page mapping fix for shmem fault handler, a power-off fix for ivpu, a GFP_* flag fix for syncobj, and a MAINTAINERS update. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <mripard@redhat.com> Link: https://patch.msgid.link/20260326-lush-cuddly-limpet-ab2aa9@houat
2026-03-27Merge tag 'drm-intel-fixes-2026-03-26' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes - DP tunnel error handling fix - Spurious GMBUS timeout fix - Unlink NV12 planes earlier - Order OP vs. timeout correctly in __wait_for() Signed-off-by: Dave Airlie <airlied@redhat.com> From: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patch.msgid.link/acTdjAoOGkzl3dcc@jlahtine-mobl
2026-03-26Revert "drm: Fix use-after-free on framebuffers and property blobs when ↵Maarten Lankhorst
calling drm_dev_unplug" This reverts commit 6bee098b91417654703e17eb5c1822c6dfd0c01d. Den 2026-03-25 kl. 22:11, skrev Simona Vetter: > On Wed, Mar 25, 2026 at 10:26:40AM -0700, Guenter Roeck wrote: >> Hi, >> >> On Fri, Mar 13, 2026 at 04:17:27PM +0100, Maarten Lankhorst wrote: >>> When trying to do a rather aggressive test of igt's "xe_module_load >>> --r reload" with a full desktop environment and game running I noticed >>> a few OOPSes when dereferencing freed pointers, related to >>> framebuffers and property blobs after the compositor exits. >>> >>> Solve this by guarding the freeing in drm_file with drm_dev_enter/exit, >>> and immediately put the references from struct drm_file objects during >>> drm_dev_unplug(). >>> >> >> With this patch in v6.18.20, I get the warning backtraces below. >> The backtraces are gone with the patch reverted. > > Yeah, this needs to be reverted, reasoning below. Maarten, can you please > take care of that and feed the revert through the usual channels? I don't > think it's critical enough that we need to fast-track this into drm.git > directly. > > Quoting the patch here again: > >> drivers/gpu/drm/drm_file.c | 5 ++++- >> drivers/gpu/drm/drm_mode_config.c | 9 ++++++--- >> 2 files changed, 10 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c >> index ec820686b3021..f52141f842a1f 100644 >> --- a/drivers/gpu/drm/drm_file.c >> +++ b/drivers/gpu/drm/drm_file.c >> @@ -233,6 +233,7 @@ static void drm_events_release(struct drm_file *file_priv) >> void drm_file_free(struct drm_file *file) >> { >> struct drm_device *dev; >> + int idx; >> >> if (!file) >> return; >> @@ -249,9 +250,11 @@ void drm_file_free(struct drm_file *file) >> >> drm_events_release(file); >> >> - if (drm_core_check_feature(dev, DRIVER_MODESET)) { >> + if (drm_core_check_feature(dev, DRIVER_MODESET) && >> + drm_dev_enter(dev, &idx)) { > > This is misplaced for two reasons: > > - Even if we'd want to guarantee that we hold a drm_dev_enter/exit > reference during framebuffer teardown, we'd need to do this > _consistently over all callsites. Not ad-hoc in just one place that a > testcase hits. This also means kerneldoc updates of the relevant hooks > and at least a bunch of acks from other driver people to document the > consensus. > > - More importantly, this is driver responsibilities in general unless we > have extremely good reasons to the contrary. Which means this must be > placed in xe. > >> drm_fb_release(file); >> drm_property_destroy_user_blobs(dev, file); >> + drm_dev_exit(idx); >> } >> >> if (drm_core_check_feature(dev, DRIVER_SYNCOBJ)) >> diff --git a/drivers/gpu/drm/drm_mode_config.c b/drivers/gpu/drm/drm_mode_config.c >> index 84ae8a23a3678..e349418978f79 100644 >> --- a/drivers/gpu/drm/drm_mode_config.c >> +++ b/drivers/gpu/drm/drm_mode_config.c >> @@ -583,10 +583,13 @@ void drm_mode_config_cleanup(struct drm_device *dev) >> */ >> WARN_ON(!list_empty(&dev->mode_config.fb_list)); >> list_for_each_entry_safe(fb, fbt, &dev->mode_config.fb_list, head) { >> - struct drm_printer p = drm_dbg_printer(dev, DRM_UT_KMS, "[leaked fb]"); >> + if (list_empty(&fb->filp_head) || drm_framebuffer_read_refcount(fb) > 1) { >> + struct drm_printer p = drm_dbg_printer(dev, DRM_UT_KMS, "[leaked fb]"); > > This is also wrong: > > - Firstly, it's a completely independent bug, we do not smash two bugfixes > into one patch. > > - Secondly, it's again a driver bug: drm_mode_cleanup must be called when > the last drm_device reference disappears (hence the existence of > drmm_mode_config_init), not when the driver gets unbound. The fact that > this shows up in a callchain from a devres cleanup means the intel > driver gets this wrong (like almost everyone else because historically > we didn't know better). > > If we don't follow this rule, then we get races with this code here > running concurrently with drm_file fb cleanups, which just does not > work. Review pointed that out, but then shrugged it off with a confused > explanation: > > https://lore.kernel.org/all/e61e64c796ccfb17ae673331a3df4b877bf42d82.camel@linux.intel.com/ > > Yes this also means a lot of the other drm_device teardown that drivers > do happens way too early. There is a massive can of worms here of a > magnitude that most likely is much, much bigger than what you can > backport to stable kernels. Hotunplug is _hard_. Back to the drawing board, and fixing it in the intel display driver instead. Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Fixes: 6bee098b9141 ("drm: Fix use-after-free on framebuffers and property blobs when calling drm_dev_unplug") Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Simona Vetter <simona.vetter@ffwll.ch> Signed-off-by: Maarten Lankhorst <dev@lankhorst.se> Link: https://patch.msgid.link/20260326082217.39941-2-dev@lankhorst.se
2026-03-26drm/bridge: Fix refcount shown via debugfs for encoder_bridges_show()Liu Ying
A typical bridge refcount value is 3 after a bridge chain is formed: - devm_drm_bridge_alloc() initializes the refcount value to be 1. - drm_bridge_add() gets an additional reference hence 2. - drm_bridge_attach() gets the third reference hence 3. This typical refcount value aligns with allbridges_show()'s behaviour. However, since encoder_bridges_show() uses drm_for_each_bridge_in_chain_scoped() to automatically get/put the bridge reference while iterating, a bogus reference is accidentally got when showing the wrong typical refcount value as 4 to users via debugfs. Fix this by caching the refcount value returned from kref_read() while iterating and explicitly decreasing the cached refcount value by 1 before showing it to users. Fixes: bd57048e4576 ("drm/bridge: use drm_for_each_bridge_in_chain_scoped()") Signed-off-by: Liu Ying <victor.liu@nxp.com> Reviewed-by: Luca Ceresoli <luca.ceresoli@bootlin.com> Tested-by: Luca Ceresoli <luca.ceresoli@bootlin.com> Link: https://patch.msgid.link/20260318-drm-misc-next-2026-03-05-fix-encoder-bridges-refcount-v3-1-147fea581279@nxp.com Signed-off-by: Luca Ceresoli <luca.ceresoli@bootlin.com>
2026-03-25drm/xe: always keep track of remap prev/nextMatthew Auld
During 3D workload, user is reporting hitting: [ 413.361679] WARNING: drivers/gpu/drm/xe/xe_vm.c:1217 at vm_bind_ioctl_ops_unwind+0x1e2/0x2e0 [xe], CPU#7: vkd3d_queue/9925 [ 413.361944] CPU: 7 UID: 1000 PID: 9925 Comm: vkd3d_queue Kdump: loaded Not tainted 7.0.0-070000rc3-generic #202603090038 PREEMPT(lazy) [ 413.361949] RIP: 0010:vm_bind_ioctl_ops_unwind+0x1e2/0x2e0 [xe] [ 413.362074] RSP: 0018:ffffd4c25c3df930 EFLAGS: 00010282 [ 413.362077] RAX: 0000000000000000 RBX: ffff8f3ee817ed10 RCX: 0000000000000000 [ 413.362078] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 413.362079] RBP: ffffd4c25c3df980 R08: 0000000000000000 R09: 0000000000000000 [ 413.362081] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8f41fbf99380 [ 413.362082] R13: ffff8f3ee817e968 R14: 00000000ffffffef R15: ffff8f43d00bd380 [ 413.362083] FS: 00000001040ff6c0(0000) GS:ffff8f4696d89000(0000) knlGS:00000000330b0000 [ 413.362085] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 [ 413.362086] CR2: 00007ddfc4747000 CR3: 00000002e6262005 CR4: 0000000000f72ef0 [ 413.362088] PKRU: 55555554 [ 413.362089] Call Trace: [ 413.362092] <TASK> [ 413.362096] xe_vm_bind_ioctl+0xa9a/0xc60 [xe] Which seems to hint that the vma we are re-inserting for the ops unwind is either invalid or overlapping with something already inserted in the vm. It shouldn't be invalid since this is a re-insertion, so must have worked before. Leaving the likely culprit as something already placed where we want to insert the vma. Following from that, for the case where we do something like a rebind in the middle of a vma, and one or both mapped ends are already compatible, we skip doing the rebind of those vma and set next/prev to NULL. As well as then adjust the original unmap va range, to avoid unmapping the ends. However, if we trigger the unwind path, we end up with three va, with the two ends never being removed and the original va range in the middle still being the shrunken size. If this occurs, one failure mode is when another unwind op needs to interact with that range, which can happen with a vector of binds. For example, if we need to re-insert something in place of the original va. In this case the va is still the shrunken version, so when removing it and then doing a re-insert it can overlap with the ends, which were never removed, triggering a warning like above, plus leaving the vm in a bad state. With that, we need two things here: 1) Stop nuking the prev/next tracking for the skip cases. Instead relying on checking for skip prev/next, where needed. That way on the unwind path, we now correctly remove both ends. 2) Undo the unmap va shrinkage, on the unwind path. With the two ends now removed the unmap va should expand back to the original size again, before re-insertion. v2: - Update the explanation in the commit message, based on an actual IGT of triggering this issue, rather than conjecture. - Also undo the unmap shrinkage, for the skip case. With the two ends now removed, the original unmap va range should expand back to the original range. v3: - Track the old start/range separately. vma_size/start() uses the va info directly. Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/7602 Fixes: 8f33b4f054fc ("drm/xe: Avoid doing rebinds") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260318100208.78097-2-matthew.auld@intel.com (cherry picked from commit aec6969f75afbf4e01fd5fb5850ed3e9c27043ac) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2026-03-25drm/syncobj: Fix xa_alloc allocation flagsTvrtko Ursulin
The xarray conversion blindly and wrongly replaced idr_alloc with xa_alloc and kept the GFP_NOWAIT. It should have been GFP_KERNEL to account for idr_preload it removed. Fix it. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Fixes: fec2c3c01f1c ("drm/syncobj: Convert syncobj idr to xarray") Reported-by: Himanshu Girotra <himanshu.girotra@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Himanshu Girotra <himanshu.girotra@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net> Link: https://lore.kernel.org/r/20260324111019.22467-1-tvrtko.ursulin@igalia.com
2026-03-24drm/amd/display: Fix DCE LVDS handlingAlex Deucher
LVDS does not use an HPD pin so it may be invalid. Handle this case correctly in link encoder creation. Fixes: 7c8fb3b8e9ba ("drm/amd/display: Add hpd_source index check for DCE60/80/100/110/112/120 link encoders") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/5012 Cc: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Cc: Roman Li <roman.li@amd.com> Reviewed-by: Roman Li <roman.li@amd.com> Reviewed-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 3b5620f7ee688177fcf65cf61588c5435bce1872) Cc: stable@vger.kernel.org
2026-03-24drm/amdgpu: Handle GPU page faults correctly on non-4K page systemsDonet Tom
During a GPU page fault, the driver restores the SVM range and then maps it into the GPU page tables. The current implementation passes a GPU-page-size (4K-based) PFN to svm_range_restore_pages() to restore the range. SVM ranges are tracked using system-page-size PFNs. On systems where the system page size is larger than 4K, using GPU-page-size PFNs to restore the range causes two problems: Range lookup fails: Because the restore function receives PFNs in GPU (4K) units, the SVM range lookup does not find the existing range. This will result in a duplicate SVM range being created. VMA lookup failure: The restore function also tries to locate the VMA for the faulting address. It converts the GPU-page-size PFN into an address using the system page size, which results in an incorrect address on non-4K page-size systems. As a result, the VMA lookup fails with the message: "address 0xxxx VMA is removed". This patch passes the system-page-size PFN to svm_range_restore_pages() so that the SVM range is restored correctly on non-4K page systems. Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 074fe395fb13247b057f60004c7ebcca9f38ef46)