linux-stable.git/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c, branch linux-6.10.y

drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic

2024-09-12T09:13:05+00:00

[ Upstream commit 6e4aa08fa9c6c0c027fc86f242517c925d159393 ]

The retry loop for SRIOV reset have refcount and memory leak issue.
Depending on which function call fails it can potentially call
amdgpu_amdkfd_pre/post_reset different number of times and causes
kfd_locked count to be wrong. This will block all future attempts at
opening /dev/kfd. The retry loop also leakes resources by calling
amdgpu_virt_init_data_exchange multiple times without calling the
corresponding fini function.

Align with the bare-metal reset path which doesn't have these issues.
This means taking the amdgpu_amdkfd_pre/post_reset functions out of the
reset loop and calling amdgpu_device_pre_asic_reset each retry which
properly free the resources from previous try by calling
amdgpu_virt_fini_data_exchange.

Signed-off-by: Yunxiang Li 
Reviewed-by: Emily Deng 
Reviewed-by: Zhigang Luo 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin

drm/amdgpu: Add reset_context flag for host FLR

2024-09-12T09:13:05+00:00

[ Upstream commit 25c01191c2555351922e5515b6b6d31357975031 ]

There are other reset sources that pass NULL as the job pointer, such as
amdgpu_amdkfd_reset_work. Therefore, using the job pointer to check if
the FLR comes from the host does not work.

Add a flag in reset_context to explicitly mark host triggered reset, and
set this flag when we receive host reset notification.

Signed-off-by: Yunxiang Li 
Reviewed-by: Emily Deng 
Reviewed-by: Zhigang Luo 
Signed-off-by: Alex Deucher 
Stable-dep-of: 6e4aa08fa9c6 ("drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic")
Signed-off-by: Sasha Levin

drm/amdgpu: Fix two reset triggered in a row

2024-09-12T09:13:05+00:00

[ Upstream commit f4322b9f8ad5f9f62add288c785d2e10bb6a5efe ]

Some times a hang GPU causes multiple reset sources to schedule resets.
The second source will be able to trigger an unnecessary reset if they
schedule after we call amdgpu_device_stop_pending_resets.

Move amdgpu_device_stop_pending_resets to after the reset is done. Since
at this point the GPU is supposedly in a good state, any reset scheduled
after this point would be a legitimate reset.

Remove unnecessary and incorrect checks for amdgpu_in_reset that was
kinda serving this purpose.

Signed-off-by: Yunxiang Li 
Reviewed-by: Lijo Lazar 
Signed-off-by: Alex Deucher 
Stable-dep-of: 6e4aa08fa9c6 ("drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic")
Signed-off-by: Sasha Levin

drm/amdgpu: fix dereference after null check

2024-09-08T05:56:29+00:00

[ Upstream commit b1f7810b05d1950350ac2e06992982974343e441 ]

check the pointer hive before use.

Signed-off-by: Jesse Zhang 
Reviewed-by: Tim Huang 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin

drm/amd/amdgpu: Check tbo resource pointer

2024-09-08T05:56:25+00:00

[ Upstream commit 6cd2b872643bb29bba01a8ac739138db7bd79007 ]

Validate tbo resource pointer, skip if NULL

Signed-off-by: Asad Kamal 
Reviewed-by: Christian König 
Reviewed-by: Lijo Lazar 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin

drm/amdgpu: Add lock around VF RLCG interface

2024-08-14T13:34:14+00:00

[ Upstream commit e864180ee49b4d30e640fd1e1d852b86411420c9 ]

flush_gpu_tlb may be called from another thread while
device_gpu_recover is running.

Both of these threads access registers through the VF
RLCG interface during VF Full Access. Add a lock around this interface
to prevent race conditions between these threads.

Signed-off-by: Victor Skvortsov 
Reviewed-by: Zhigang Luo 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin

drm/amdgpu: Check if NBIO funcs are NULL in amdgpu_device_baco_exit

2024-08-03T06:59:50+00:00

[ Upstream commit 0cdb3f9740844b9d95ca413e3fcff11f81223ecf ]

The special case for VM passthrough doesn't check adev->nbio.funcs
before dereferencing it. If GPUs that don't have an NBIO block are
passed through, this leads to a NULL pointer dereference on startup.

Signed-off-by: Friedrich Vock 
Fixes: 1bece222eabe ("drm/amdgpu: Clear doorbell interrupt status for Sienna Cichlid")
Cc: Alex Deucher 
Cc: Christian König 
Acked-by: Christian König 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin

drm/amdgpu: Fix pci state save during mode-1 reset

2024-06-25T18:13:12+00:00

Cache the PCI state before bus master is disabled. The saved state is
later used for other cases like restoring config space after mode-2
reset.

Fixes: 5c03e5843e6b ("drm/amdgpu:add smu mode1/2 support for aldebaran")
Signed-off-by: Lijo Lazar 
Reviewed-by: Feifei Xu 
Reviewed-by: Hawking Zhang 
Signed-off-by: Alex Deucher

drm/amdgpu: Adjust logic in amdgpu_device_partner_bandwidth()

2024-05-29T21:01:49+00:00

Use current speed/width on devices which don't support
dynamic PCIe switching.

Fixes: 466a7d115326 ("drm/amd: Use the first non-dGPU PCI device for BW limits")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3289
Acked-by: Christian König 
Signed-off-by: Alex Deucher

drm/amdgpu: skip ip dump if devcoredump flag is set

2024-04-26T21:22:44+00:00

Do not dump the ip registers during driver reload
in passthrough environment.

Signed-off-by: Sunil Khatri 
Reviewed-by: Alex Deucher 
Reviewed-by: Lijo Lazar 
Signed-off-by: Alex Deucher