linux.git/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c, branch v7.2-rc1

drm/amdgpu: initialize irq.lock spinlock earlier

2026-06-17T22:19:47+00:00

If there is an early failure during amdgpu probe, like missing firmware, it
will end up calling amdgpu_irq_disable_all, which takes irq.lock spinlock
without it being initialized.

Initializing irq.lock earlier at amdgpu_device_init fixes the issue.

[   79.334079] INFO: trying to register non-static key.
[   79.334081] The code is fine but needs lockdep annotation, or maybe
[   79.334083] you didn't initialize this object before use?
[   79.334084] turning off the locking correctness validator.
[   79.334088] CPU: 2 UID: 0 PID: 1819 Comm: bash Not tainted 7.1.0-rc5-gfd06300b2348 #96 PREEMPT  8e8f461221633dae3c832d6689eaf0546c0ed4cd
[   79.334092] Hardware name: Valve Jupiter/Jupiter, BIOS F7A0133 08/05/2024
[   79.334094] Call Trace:
[   79.334095]  
[   79.334097]  dump_stack_lvl+0x5d/0x80
[   79.334103]  register_lock_class+0x7af/0x7c0
[   79.334109]  __lock_acquire+0x416/0x2610
[   79.334114]  lock_acquire+0xcf/0x310
[   79.334117]  ? amdgpu_irq_disable_all+0x3b/0xf0 [amdgpu c88bab43d391d519ad0d5c8e5a099b4aceefa180]
[   79.334503]  ? _raw_spin_lock_irqsave+0x53/0x60
[   79.334508]  _raw_spin_lock_irqsave+0x3f/0x60
[   79.334510]  ? amdgpu_irq_disable_all+0x3b/0xf0 [amdgpu c88bab43d391d519ad0d5c8e5a099b4aceefa180]
[   79.334881]  amdgpu_irq_disable_all+0x3b/0xf0 [amdgpu c88bab43d391d519ad0d5c8e5a099b4aceefa180]
[   79.335240]  amdgpu_device_fini_hw+0x90/0x32c [amdgpu c88bab43d391d519ad0d5c8e5a099b4aceefa180]
[   79.335704]  amdgpu_driver_load_kms.cold+0x22/0x44 [amdgpu c88bab43d391d519ad0d5c8e5a099b4aceefa180]
[   79.336159]  amdgpu_pci_probe+0x204/0x440 [amdgpu c88bab43d391d519ad0d5c8e5a099b4aceefa180]
[   79.336494]  local_pci_probe+0x3c/0x80
[   79.336500]  pci_call_probe+0x55/0x2e0
[   79.336505]  ? _raw_spin_unlock+0x2d/0x50
[   79.336508]  ? pci_match_device+0x157/0x180
[   79.336512]  pci_device_probe+0x9b/0x170
[   79.336516]  really_probe+0xd5/0x370
[   79.336521]  __driver_probe_device+0x84/0x150
[   79.336525]  device_driver_attach+0x47/0xb0
[   79.336528]  bind_store+0x73/0xc0
[   79.336531]  kernfs_fop_write_iter+0x176/0x250
[   79.336536]  vfs_write+0x24d/0x560
[   79.336542]  ksys_write+0x71/0xe0
[   79.336546]  do_syscall_64+0x122/0x710
[   79.336550]  ? do_syscall_64+0xd1/0x710
[   79.336553]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
[   79.336557] RIP: 0033:0x7f92fd675006
[   79.336561] Code: 5d e8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 75 19 83 e2 39 83 fa 08 75 11 e8 26 ff ff ff 66 0f 1f 44 00 00 48 8b 45 10 0f 05 <48> 8b 5d f8 c9 c3 0f 1f 40 00 f3 0f 1e fa 55 48 89 e5 48 83 ec 08
[   79.336562] RSP: 002b:00007ffe4fa867a0 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[   79.336565] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007f92fd675006
[   79.336567] RDX: 000000000000000d RSI: 000055b2dfce59b0 RDI: 0000000000000001
[   79.336568] RBP: 00007ffe4fa867c0 R08: 0000000000000000 R09: 0000000000000000
[   79.336569] R10: 0000000000000000 R11: 0000000000000202 R12: 000000000000000d
[   79.336570] R13: 000055b2dfce59b0 R14: 00007f92fd7ca5c0 R15: 000055b2dfdbaf70
[   79.336574]  

Fixes: 9950cda2a018 ("drm/amdgpu: drop the drm irq pre/post/un install callbacks")
Reviewed-by: Tvrtko Ursulin 
Signed-off-by: Thadeu Lima de Souza Cascardo 
Signed-off-by: Alex Deucher 
(cherry picked from commit 7dba3e10ecdeec85208e255853fcd3890880b10e)

drm/amdgpu: skip already suspended IP blocks in ip_suspend_phase2

2026-06-17T22:08:54+00:00

The GPU reload test (S3 / mode1 reset / module reload) triggers a
WARN_ON in amdgpu_irq_put() on gfx10 when unloading amdgpu:

  WARNING: CPU: 0 PID: 2314 at amd/amdgpu/amdgpu_irq.c:676 amdgpu_irq_put+0xc3/0xe0 [amdgpu]
  Call Trace:
   gfx_v10_0_hw_fini+0x41/0x150 [amdgpu]
   amdgpu_ip_block_hw_fini+0x29/0xc0 [amdgpu]
   amdgpu_device_fini_hw+0x315/0x610 [amdgpu]
   amdgpu_driver_unload_kms+0x7c/0x90 [amdgpu]
   amdgpu_pci_remove+0x51/0x90 [amdgpu]

amdgpu_device_ip_resume_phase2() skips IP blocks whose status.hw is
already set, but amdgpu_device_ip_suspend_phase2() never had the
matching guard, so a block can be suspended twice (e.g. a reset or
recovery issued while the device is already suspended).  The second
suspend runs hw_fini again, which now releases the gfx fault IRQs
unconditionally, dropping a refcount that is already zero and tripping
the WARN_ON in amdgpu_irq_put().

The fault/EOP IRQ get/put were balanced through late_init/hw_fini
before, which masked the double-suspend; moving the get into hw_init
made the suspend/resume asymmetry visible as an IRQ refcount underflow.

Honor status.hw in ip_suspend_phase2() so suspend mirrors resume and a
block is only torn down once.

Fixes: 9117d8be850b ("drm/amdgpu/gfx: move fault and EOP IRQ get/put to hw_init/hw_fini")
Fixes: 482f0e538580 ("drm/amdgpu: fix double ucode load by PSP(v3)")
Signed-off-by: Yunxiang Li 
Reviewed-by: Alex Deucher 
Signed-off-by: Alex Deucher 
(cherry picked from commit f44f2af13c418969be358b15743f939d705de998)

drm/amdgpu: deprecate guilty handling

2026-06-04T19:24:29+00:00

The guilty handling tried to establish a second way of signaling problems with
the GPU back to userspace. This caused quite a bunch of issue we had to work
around, especially lifetime issues with the drm_sched_entity.

Just drop the handling altogether and use the dma_fence based approach instead.

v2: fix reversed condition in entity check (Alex)

Reviewed-by: Alex Deucher 
Signed-off-by: Christian König 
Signed-off-by: Alex Deucher

drm/amdgpu: Add lockdep annotations for lock ordering validation

2026-06-04T19:24:22+00:00

Add lockdep annotations to teach lockdep the correct lock hierarchy
and catch ordering violations during development. This follows the
pattern established by dma-resv in drivers/dma-buf/dma-resv.c.

Lock ordering hierarchy (outermost to innermost):

1. userq_sch_mutex   - Global userq scheduler (enforce_isolation)
2. userq_mutex       - Per-context userq (held across queue create/destroy)
3. notifier_lock     - MMU notifier synchronization
4. vram_lock         - VRAM memory allocator
5. reset_domain->sem - GPU reset synchronization
6. reset_lock        - Reset control mutex
7. srbm_mutex        - SRBM register access
8. grbm_idx_mutex    - GRBM index register access
9. mmio_idx_lock     - MMIO index access (spinlock)

The implementation provides:
- Lock ordering training at module init (amdgpu_lockdep_init)
- Lock class association for real driver locks (amdgpu_lockdep_set_class)

Dummy locks are associated with the same class keys as real driver locks
via lockdep_set_class(), ensuring lockdep connects the training ordering
with actual runtime locks.

Testing:
  Build the kernel with CONFIG_PROVE_LOCKING=y (enables CONFIG_LOCKDEP):
    scripts/config --enable PROVE_LOCKING
    scripts/config --enable DEBUG_LOCKDEP
    make -j$(nproc)

  On boot, dmesg should show:
    AMDGPU: Lockdep annotations initialized (9 lock levels)

  The companion IGT test (tests/amdgpu/amd_lockdep) exercises lock-heavy
  GPU code paths concurrently to trigger lockdep warnings on violations:
    sudo ./build/tests/amdgpu/amd_lockdep
    sudo dmesg | grep -A 50 "circular locking dependency"

  IGT subtests:
    concurrent-reset-and-submit  - reset_sem vs submission locks
    concurrent-mmap-and-evict    - mmap_lock vs vram_lock
    concurrent-userptr-and-reset - notifier_lock vs reset_sem
    stress-all-paths             - all of the above simultaneously

  A clean dmesg (no "circular locking dependency" or "possible recursive
  locking detected" messages) confirms no lock ordering violations.

  For CI integration, the test should be run on kernels compiled with
  CONFIG_LOCKDEP=y; dmesg is scanned post-run for lockdep splats.

v2: (Christian)
- Move notifier_lock and vram_lock before reset locks in hierarchy.
  HMM invalidation holds notifier_lock and can wait for GPU reset
  completion, so notifier_lock must be outer to reset_domain->sem.
- Associate dummy locks with lock class keys via lockdep_set_class()
  so lockdep connects training with real driver locks.
- Update commit message to list all 9 lock levels.

Requires CONFIG_PROVE_LOCKING=y to activate.

Cc: Christian Konig 
Cc: Alex Deucher 
Signed-off-by: Vitaly Prosyak 
Reviewed-by: Christian Konig 
Signed-off-by: Alex Deucher

drm/amdgpu: Fix user-triggerable BUG()/BUG_ON() calls

2026-06-04T19:24:13+00:00

Replace BUG()/BUG_ON() with error logs and safe returns in several
places where they can be triggered by invalid userspace input,
preventing DoS via kernel panic.

Signed-off-by: Ce Sun 
Acked-by: Alex Deucher 
Signed-off-by: Alex Deucher

drm/amdgpu: Adjust _PR3 detection

2026-06-03T17:55:45+00:00

_PR3 detection was changed in commit 134b8c5d8674 ("drm/amd: Fix
detection of _PR3 on the PCIe root port") to look at the root port
of the topology containing the GPU.  This however was too far because
it ignored whether or not all the intermediary bridges could power
off the device.  The original design in commit b10c1c5b3a4e ("drm/amdgpu:
add check for ACPI power resources") was too narrow because it matched
the switches internal to the GPU.

Use the goldilocks approach and look for the first bridge outside of the
GPU and check for _PR3 on that device.

Fixes: 134b8c5d8674 ("drm/amd: Fix detection of _PR3 on the PCIe root port")
Reviewed-by: Alex Deucher 
Signed-off-by: Mario Limonciello 
Signed-off-by: Alex Deucher

drm/amd: Fix amdgpu_device_find_parent()

2026-06-03T17:47:12+00:00

commit eb53125a7ad9 ("drm/amd: Add dedicated helper for
amdgpu_device_find_parent()") created a dedicated helper to find
the parent device outside of the dGPU but it had a logic error
that caused it to walk all the way up the topology and return
the wrong device.

Break out of the loop when the device is found.

Reviewed-by: Alexander Deucher 
Fixes: eb53125a7ad9 ("drm/amd: Add dedicated helper for amdgpu_device_find_parent()")
Signed-off-by: Mario Limonciello 
Signed-off-by: Alex Deucher

drm/amd: Add dedicated helper for amdgpu_device_find_parent()

2026-05-27T14:48:36+00:00

There are a few cases that code walks up the topology to find the
link partner of the integrated switch in a dGPU.  Split this out
to a helper and call in all places.

This does have a functional change that amdgpu_device_gpu_bandwidth()
doesn't cache the internal link but only the parent.

Reviewed-by: Alex Deucher 
Signed-off-by: Mario Limonciello 
Signed-off-by: Alex Deucher

drm/amdgpu: unmap all user mappings of framebuffer and doorbell before mode1 reset

2026-05-19T15:45:41+00:00

During Mode 1 reset, the ASIC undergoes a reset cycle and becomes temporarily
inaccessible via PCIe. Any attempt to access framebuffer or MMIO registers during
this window can result in uncompleted PCIe transactions, leading to NMI panics or
system hangs.

To prevent this, Unmap all of the applications mappings of the framebuffer
and doorbell BARs before mode1 reset. Also prevent new mappings from coming in
during the reset process.

v2: remove inode in kfd_dev (Christian)
v3: correct unmap offset (Felix), remove prevent new mappings part
to avoid deadlock (Christian)

Reviewed-by: Felix Kuehling 
Signed-off-by: Yifan Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: rework userq reset work handling

2026-05-18T22:07:26+00:00

It is illegal to schedule reset work from another reset work!

Fix this by scheduling the userq reset work directly on the work queue
of the reset domain.

Not fully tested, I leave that to the IGT test cases.

Signed-off-by: Christian König 
Reviewed-by: Prike Liang 
Reviewed-by: Sunil Khatri 
Signed-off-by: Alex Deucher