linux.git/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h, branch v5.8

drm/amdkfd: Enable over-subscription with >1 GWS queue

2020-04-28T20:20:30+00:00

The current GWS usage model will only allows a single GWS-enabled
process to be active on the GPU at once. This ensures that a
barrier-using kernel gets a known amount of GPU hardware, to
prevent deadlock due to inability to go beyond the GWS barrier.

The HWS watches how many GWS entries are assigned to each process,
and goes into over-subscription mode when two processes need more
than the 64 that are available. The current KFD method for working
with this is to allocate all 64 GWS entries to each GWS-capable
process.

When more than one GWS-enabled process is in the runlist, we must
make sure the runlist is in over-subscription mode, so that the
HWS gets a chained RUN_LIST packet and continues scheduling
kernels.

Signed-off-by: Joseph Greathouse 
Reviewed-by: Felix Kuehling 
Signed-off-by: Felix Kuehling 
Signed-off-by: Alex Deucher

drm/amdkfd: Delete unnecessary unmap queue package submissions

2020-02-26T19:20:33+00:00

The previous way of using SDMA queue count to infer whether we should unmap
SDMA engines has bugs. The reason it did not cause issues is because MEC
firmware unmaps all queues (CP + SDMA) when a unmap package for compute
engine is received. Becasue of that, only one unmap queue package
is needed, instead of one unmap queue package for CP and each SDMA engine,
which results in much simpler driver code.

Signed-off-by: Yong Zhao 
Acked-by: Alex Deucher 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher

drm/amdkfd: Count active CP queues directly

2020-02-26T19:20:13+00:00

The previous code of calculating active CP queues is problematic if
some SDMA queues are inactive. Fix that by counting CP queues directly.

Signed-off-by: Yong Zhao 
Acked-by: Alex Deucher 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher

drm/amdkfd: Avoid ambiguity by indicating it's cp queue

2020-02-26T19:20:05+00:00

The queues represented in queue_bitmap are only CP queues.

Signed-off-by: Yong Zhao 
Acked-by: Alex Deucher 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher

drm/amdkfd: Rename queue_count to active_queue_count

2020-02-26T19:19:38+00:00

The name is easier to understand the code.

Signed-off-by: Yong Zhao 
Acked-by: Alex Deucher 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher

drm/amdkfd: Improve HWS hang detection and handling

2020-01-07T16:54:56+00:00

Move HWS hang detection into unmap_queues_cpsch to catch hangs in all
cases. If this happens during a reset, don't schedule another reset
because the reset already in progress is expected to take care of it.

Signed-off-by: Felix Kuehling 
Tested-by: Emily Deng 
Reviewed-by: shaoyunl  
Signed-off-by: Alex Deucher

drm/amdkfd: Remove unused variable

2020-01-07T16:54:50+00:00

dqm->pipeline_mem wasn't used anywhere.

Signed-off-by: Felix Kuehling 
Reviewed-by: shaoyunl  
Signed-off-by: Alex Deucher

drm/amdkfd: don't use dqm lock during device reset/suspend/resume

2019-10-25T20:50:10+00:00

If device reset/suspend/resume failed for some reason, dqm lock is
hold forever and this causes deadlock. Below is a kernel backtrace when
application open kfd after suspend/resume failed.

Instead of holding dqm lock in pre_reset and releasing dqm lock in
post_reset, add dqm->sched_running flag which is modified in
dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection
because write/read are all inside dqm lock.

For HWS case, map_queues_cpsch and unmap_queues_cpsch checks
sched_running flag before sending the updated runlist.

v2: For no-HWS case, when device is stopped, don't call
load/destroy_mqd for eviction, restore and create queue, and avoid
debugfs dump hdqs.

Backtrace of dqm lock deadlock:

[Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more
than 120 seconds.
[Thu Oct 17 16:43:37 2019]       Not tainted
5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1
[Thu Oct 17 16:43:37 2019] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Oct 17 16:43:37 2019] rocminfo        D    0  3024   2947
0x80000000
[Thu Oct 17 16:43:37 2019] Call Trace:
[Thu Oct 17 16:43:37 2019]  ? __schedule+0x3d9/0x8a0
[Thu Oct 17 16:43:37 2019]  schedule+0x32/0x70
[Thu Oct 17 16:43:37 2019]  schedule_preempt_disabled+0xa/0x10
[Thu Oct 17 16:43:37 2019]  __mutex_lock.isra.9+0x1e3/0x4e0
[Thu Oct 17 16:43:37 2019]  ? __call_srcu+0x264/0x3b0
[Thu Oct 17 16:43:37 2019]  ? process_termination_cpsch+0x24/0x2f0
[amdgpu]
[Thu Oct 17 16:43:37 2019]  process_termination_cpsch+0x24/0x2f0
[amdgpu]
[Thu Oct 17 16:43:37 2019]
kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu]
[Thu Oct 17 16:43:37 2019]  kfd_process_notifier_release+0x1be/0x220
[amdgpu]
[Thu Oct 17 16:43:37 2019]  __mmu_notifier_release+0x3e/0xc0
[Thu Oct 17 16:43:37 2019]  exit_mmap+0x160/0x1a0
[Thu Oct 17 16:43:37 2019]  ? __handle_mm_fault+0xba3/0x1200
[Thu Oct 17 16:43:37 2019]  ? exit_robust_list+0x5a/0x110
[Thu Oct 17 16:43:37 2019]  mmput+0x4a/0x120
[Thu Oct 17 16:43:37 2019]  do_exit+0x284/0xb20
[Thu Oct 17 16:43:37 2019]  ? handle_mm_fault+0xfa/0x200
[Thu Oct 17 16:43:37 2019]  do_group_exit+0x3a/0xa0
[Thu Oct 17 16:43:37 2019]  __x64_sys_exit_group+0x14/0x20
[Thu Oct 17 16:43:37 2019]  do_syscall_64+0x4f/0x100
[Thu Oct 17 16:43:37 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Suggested-by: Felix Kuehling 
Signed-off-by: Philip Yang 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher

drm/amdkfd: Record vmid pasid mapping in the driver for non HWS mode

2019-10-03T14:11:03+00:00

This makes possible the vmid pasid mapping query through software.

Signed-off-by: Yong Zhao 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher

drm/amdkfd: Add navi10 support to amdkfd. (v3)

2019-06-21T23:59:24+00:00

KFD (kernel fusion driver) is the kernel driver
for the compute backend for usermode compute
stack.

v2: squash in updates (Alex)
v3: squash in rebase fixes (Alex)

Signed-off-by: Oak Zeng 
Signed-off-by: Philip Cox 
Acked-by: Alex Deucher 
Signed-off-by: Hawking Zhang 
Signed-off-by: Alex Deucher