<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c, branch linux-5.15.y</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>drm/amd/amdgpu embed hw_fence into amdgpu_job</title>
<updated>2021-08-16T19:16:58+00:00</updated>
<author>
<name>Jack Zhang</name>
<email>Jack.Zhang1@amd.com</email>
</author>
<published>2021-05-12T07:06:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=c530b02f39850a639b72d01ebbf7e5d745c60831'/>
<id>c530b02f39850a639b72d01ebbf7e5d745c60831</id>
<content type='text'>
Why: Previously hw fence is alloced separately with job.
It caused historical lifetime issues and corner cases.
The ideal situation is to take fence to manage both job
and fence's lifetime, and simplify the design of gpu-scheduler.

How:
We propose to embed hw_fence into amdgpu_job.
1. We cover the normal job submission by this method.
2. For ib_test, and submit without a parent job keep the
legacy way to create a hw fence separately.
v2:
use AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT to show that the fence is
embedded in a job.
v3:
remove redundant variable ring in amdgpu_job
v4:
add tdr sequence support for this feature. Add a job_run_counter to
indicate whether this job is a resubmit job.
v5
add missing handling in amdgpu_fence_enable_signaling

Signed-off-by: Jingwen Chen &lt;Jingwen.Chen2@amd.com&gt;
Signed-off-by: Jack Zhang &lt;Jack.Zhang7@hotmail.com&gt;
Reviewed-by: Andrey Grodzovsky &lt;andrey.grodzovsky@amd.com&gt;
Reviewed by: Monk Liu &lt;monk.liu@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Why: Previously hw fence is alloced separately with job.
It caused historical lifetime issues and corner cases.
The ideal situation is to take fence to manage both job
and fence's lifetime, and simplify the design of gpu-scheduler.

How:
We propose to embed hw_fence into amdgpu_job.
1. We cover the normal job submission by this method.
2. For ib_test, and submit without a parent job keep the
legacy way to create a hw fence separately.
v2:
use AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT to show that the fence is
embedded in a job.
v3:
remove redundant variable ring in amdgpu_job
v4:
add tdr sequence support for this feature. Add a job_run_counter to
indicate whether this job is a resubmit job.
v5
add missing handling in amdgpu_fence_enable_signaling

Signed-off-by: Jingwen Chen &lt;Jingwen.Chen2@amd.com&gt;
Signed-off-by: Jack Zhang &lt;Jack.Zhang7@hotmail.com&gt;
Reviewed-by: Andrey Grodzovsky &lt;andrey.grodzovsky@amd.com&gt;
Reviewed by: Monk Liu &lt;monk.liu@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>drm/amdgpu: Prevent any job recoveries after device is unplugged.</title>
<updated>2021-05-20T03:50:28+00:00</updated>
<author>
<name>Andrey Grodzovsky</name>
<email>andrey.grodzovsky@amd.com</email>
</author>
<published>2021-05-12T14:26:43+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=ca4e17244bd213ed093927491ddb8eec0c21ada3'/>
<id>ca4e17244bd213ed093927491ddb8eec0c21ada3</id>
<content type='text'>
Return DRM_TASK_STATUS_ENODEV back to the scheduler when device
is not present so they timeout timer will not be rearmed.

v5: Update to match updated return values in enum drm_gpu_sched_stat

Signed-off-by: Andrey Grodzovsky &lt;andrey.grodzovsky@amd.com&gt;
Reviewed-by: Christian König &lt;christian.koenig@amd.com&gt;
Link: https://patchwork.freedesktop.org/patch/msgid/20210512142648.666476-12-andrey.grodzovsky@amd.com
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Return DRM_TASK_STATUS_ENODEV back to the scheduler when device
is not present so they timeout timer will not be rearmed.

v5: Update to match updated return values in enum drm_gpu_sched_stat

Signed-off-by: Andrey Grodzovsky &lt;andrey.grodzovsky@amd.com&gt;
Reviewed-by: Christian König &lt;christian.koenig@amd.com&gt;
Link: https://patchwork.freedesktop.org/patch/msgid/20210512142648.666476-12-andrey.grodzovsky@amd.com
</pre>
</div>
</content>
</entry>
<entry>
<title>drm/scheduler: Job timeout handler returns status (v3)</title>
<updated>2021-01-29T10:30:22+00:00</updated>
<author>
<name>Luben Tuikov</name>
<email>luben.tuikov@amd.com</email>
</author>
<published>2021-01-20T20:09:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=a6a1f036c74e3d2a3a56b3140492c7c3ecb879f3'/>
<id>a6a1f036c74e3d2a3a56b3140492c7c3ecb879f3</id>
<content type='text'>
This patch does not change current behaviour.

The driver's job timeout handler now returns
status indicating back to the DRM layer whether
the device (GPU) is no longer available, such as
after it's been unplugged, or whether all is
normal, i.e. current behaviour.

All drivers which make use of the
drm_sched_backend_ops' .timedout_job() callback
have been accordingly renamed and return the
would've-been default value of
DRM_GPU_SCHED_STAT_NOMINAL to restart the task's
timeout timer--this is the old behaviour, and is
preserved by this patch.

v2: Use enum as the status of a driver's job
    timeout callback method.

v3: Return scheduler/device information, rather
    than task information.

Cc: Alexander Deucher &lt;Alexander.Deucher@amd.com&gt;
Cc: Andrey Grodzovsky &lt;Andrey.Grodzovsky@amd.com&gt;
Cc: Christian König &lt;christian.koenig@amd.com&gt;
Cc: Daniel Vetter &lt;daniel.vetter@ffwll.ch&gt;
Cc: Lucas Stach &lt;l.stach@pengutronix.de&gt;
Cc: Russell King &lt;linux+etnaviv@armlinux.org.uk&gt;
Cc: Christian Gmeiner &lt;christian.gmeiner@gmail.com&gt;
Cc: Qiang Yu &lt;yuq825@gmail.com&gt;
Cc: Rob Herring &lt;robh@kernel.org&gt;
Cc: Tomeu Vizoso &lt;tomeu.vizoso@collabora.com&gt;
Cc: Steven Price &lt;steven.price@arm.com&gt;
Cc: Alyssa Rosenzweig &lt;alyssa.rosenzweig@collabora.com&gt;
Cc: Eric Anholt &lt;eric@anholt.net&gt;
Reported-by: kernel test robot &lt;lkp@intel.com&gt;
Signed-off-by: Luben Tuikov &lt;luben.tuikov@amd.com&gt;
Acked-by: Alyssa Rosenzweig &lt;alyssa.rosenzweig@collabora.com&gt;
Acked-by: Christian König &lt;christian.koenig@amd.com&gt;
Acked-by: Steven Price &lt;steven.price@arm.com&gt;
Signed-off-by: Christian König &lt;christian.koenig@amd.com&gt;
Link: https://patchwork.freedesktop.org/patch/415095/
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This patch does not change current behaviour.

The driver's job timeout handler now returns
status indicating back to the DRM layer whether
the device (GPU) is no longer available, such as
after it's been unplugged, or whether all is
normal, i.e. current behaviour.

All drivers which make use of the
drm_sched_backend_ops' .timedout_job() callback
have been accordingly renamed and return the
would've-been default value of
DRM_GPU_SCHED_STAT_NOMINAL to restart the task's
timeout timer--this is the old behaviour, and is
preserved by this patch.

v2: Use enum as the status of a driver's job
    timeout callback method.

v3: Return scheduler/device information, rather
    than task information.

Cc: Alexander Deucher &lt;Alexander.Deucher@amd.com&gt;
Cc: Andrey Grodzovsky &lt;Andrey.Grodzovsky@amd.com&gt;
Cc: Christian König &lt;christian.koenig@amd.com&gt;
Cc: Daniel Vetter &lt;daniel.vetter@ffwll.ch&gt;
Cc: Lucas Stach &lt;l.stach@pengutronix.de&gt;
Cc: Russell King &lt;linux+etnaviv@armlinux.org.uk&gt;
Cc: Christian Gmeiner &lt;christian.gmeiner@gmail.com&gt;
Cc: Qiang Yu &lt;yuq825@gmail.com&gt;
Cc: Rob Herring &lt;robh@kernel.org&gt;
Cc: Tomeu Vizoso &lt;tomeu.vizoso@collabora.com&gt;
Cc: Steven Price &lt;steven.price@arm.com&gt;
Cc: Alyssa Rosenzweig &lt;alyssa.rosenzweig@collabora.com&gt;
Cc: Eric Anholt &lt;eric@anholt.net&gt;
Reported-by: kernel test robot &lt;lkp@intel.com&gt;
Signed-off-by: Luben Tuikov &lt;luben.tuikov@amd.com&gt;
Acked-by: Alyssa Rosenzweig &lt;alyssa.rosenzweig@collabora.com&gt;
Acked-by: Christian König &lt;christian.koenig@amd.com&gt;
Acked-by: Steven Price &lt;steven.price@arm.com&gt;
Signed-off-by: Christian König &lt;christian.koenig@amd.com&gt;
Link: https://patchwork.freedesktop.org/patch/415095/
</pre>
</div>
</content>
</entry>
<entry>
<title>gpu/drm: ring_mirror_list --&gt; pending_list</title>
<updated>2020-12-08T13:38:03+00:00</updated>
<author>
<name>Luben Tuikov</name>
<email>luben.tuikov@amd.com</email>
</author>
<published>2020-12-04T03:17:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=6efa4b465cfdaa9be251ba043253e302ddb9976f'/>
<id>6efa4b465cfdaa9be251ba043253e302ddb9976f</id>
<content type='text'>
Rename "ring_mirror_list" to "pending_list",
to describe what something is, not what it does,
how it's used, or how the hardware implements it.

This also abstracts the actual hardware
implementation, i.e. how the low-level driver
communicates with the device it drives, ring, CAM,
etc., shouldn't be exposed to DRM.

The pending_list keeps jobs submitted, which are
out of our control. Usually this means they are
pending execution status in hardware, but the
latter definition is a more general (inclusive)
definition.

Signed-off-by: Luben Tuikov &lt;luben.tuikov@amd.com&gt;
Acked-by: Christian König &lt;christian.koenig@amd.com&gt;
Link: https://patchwork.freedesktop.org/patch/405573/

Cc: Alexander Deucher &lt;Alexander.Deucher@amd.com&gt;
Cc: Andrey Grodzovsky &lt;Andrey.Grodzovsky@amd.com&gt;
Cc: Christian König &lt;christian.koenig@amd.com&gt;
Cc: Daniel Vetter &lt;daniel.vetter@ffwll.ch&gt;
Signed-off-by: Christian König &lt;christian.koenig@amd.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Rename "ring_mirror_list" to "pending_list",
to describe what something is, not what it does,
how it's used, or how the hardware implements it.

This also abstracts the actual hardware
implementation, i.e. how the low-level driver
communicates with the device it drives, ring, CAM,
etc., shouldn't be exposed to DRM.

The pending_list keeps jobs submitted, which are
out of our control. Usually this means they are
pending execution status in hardware, but the
latter definition is a more general (inclusive)
definition.

Signed-off-by: Luben Tuikov &lt;luben.tuikov@amd.com&gt;
Acked-by: Christian König &lt;christian.koenig@amd.com&gt;
Link: https://patchwork.freedesktop.org/patch/405573/

Cc: Alexander Deucher &lt;Alexander.Deucher@amd.com&gt;
Cc: Andrey Grodzovsky &lt;Andrey.Grodzovsky@amd.com&gt;
Cc: Christian König &lt;christian.koenig@amd.com&gt;
Cc: Daniel Vetter &lt;daniel.vetter@ffwll.ch&gt;
Signed-off-by: Christian König &lt;christian.koenig@amd.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>drm/scheduler: "node" --&gt; "list"</title>
<updated>2020-12-08T13:37:55+00:00</updated>
<author>
<name>Luben Tuikov</name>
<email>luben.tuikov@amd.com</email>
</author>
<published>2020-12-04T03:17:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=8935ff00e3b110e47e3a291ff11951eb1066b1ae'/>
<id>8935ff00e3b110e47e3a291ff11951eb1066b1ae</id>
<content type='text'>
Rename "node" to "list" in struct drm_sched_job,
in order to make it consistent with what we see
being used throughout gpu_scheduler.h, for
instance in struct drm_sched_entity, as well as
the rest of DRM and the kernel.

Signed-off-by: Luben Tuikov &lt;luben.tuikov@amd.com&gt;
Reviewed-by: Christian König &lt;christian.koenig@amd.com&gt;
Link: https://patchwork.freedesktop.org/patch/403515/

Cc: Alexander Deucher &lt;Alexander.Deucher@amd.com&gt;
Cc: Andrey Grodzovsky &lt;Andrey.Grodzovsky@amd.com&gt;
Cc: Christian König &lt;christian.koenig@amd.com&gt;
Cc: Daniel Vetter &lt;daniel.vetter@ffwll.ch&gt;
Signed-off-by: Christian König &lt;christian.koenig@amd.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Rename "node" to "list" in struct drm_sched_job,
in order to make it consistent with what we see
being used throughout gpu_scheduler.h, for
instance in struct drm_sched_entity, as well as
the rest of DRM and the kernel.

Signed-off-by: Luben Tuikov &lt;luben.tuikov@amd.com&gt;
Reviewed-by: Christian König &lt;christian.koenig@amd.com&gt;
Link: https://patchwork.freedesktop.org/patch/403515/

Cc: Alexander Deucher &lt;Alexander.Deucher@amd.com&gt;
Cc: Andrey Grodzovsky &lt;Andrey.Grodzovsky@amd.com&gt;
Cc: Christian König &lt;christian.koenig@amd.com&gt;
Cc: Daniel Vetter &lt;daniel.vetter@ffwll.ch&gt;
Signed-off-by: Christian König &lt;christian.koenig@amd.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>drm/scheduler: Scheduler priority fixes (v2)</title>
<updated>2020-08-18T22:20:17+00:00</updated>
<author>
<name>Luben Tuikov</name>
<email>luben.tuikov@amd.com</email>
</author>
<published>2020-08-11T23:59:58+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=e2d732fdb7a9e421720a644580cd6a9400f97f60'/>
<id>e2d732fdb7a9e421720a644580cd6a9400f97f60</id>
<content type='text'>
Remove DRM_SCHED_PRIORITY_LOW, as it was used
in only one place.

Rename and separate by a line
DRM_SCHED_PRIORITY_MAX to DRM_SCHED_PRIORITY_COUNT
as it represents a (total) count of said
priorities and it is used as such in loops
throughout the code. (0-based indexing is the
the count number.)

Remove redundant word HIGH in priority names,
and rename *KERNEL* to *HIGH*, as it really
means that, high.

v2: Add back KERNEL and remove SW and HW,
    in lieu of a single HIGH between NORMAL and KERNEL.

Signed-off-by: Luben Tuikov &lt;luben.tuikov@amd.com&gt;
Reviewed-by: Christian König &lt;christian.koenig@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Remove DRM_SCHED_PRIORITY_LOW, as it was used
in only one place.

Rename and separate by a line
DRM_SCHED_PRIORITY_MAX to DRM_SCHED_PRIORITY_COUNT
as it represents a (total) count of said
priorities and it is used as such in loops
throughout the code. (0-based indexing is the
the count number.)

Remove redundant word HIGH in priority names,
and rename *KERNEL* to *HIGH*, as it really
means that, high.

v2: Add back KERNEL and remove SW and HW,
    in lieu of a single HIGH between NORMAL and KERNEL.

Signed-off-by: Luben Tuikov &lt;luben.tuikov@amd.com&gt;
Reviewed-by: Christian König &lt;christian.koenig@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>drm/amdgpu: revert "fix system hang issue during GPU reset"</title>
<updated>2020-08-14T20:22:40+00:00</updated>
<author>
<name>Christian König</name>
<email>christian.koenig@amd.com</email>
</author>
<published>2020-08-12T15:48:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=f1403342ebdfcff3c3cf57ae476f19d3078f2767'/>
<id>f1403342ebdfcff3c3cf57ae476f19d3078f2767</id>
<content type='text'>
The whole approach wasn't thought through till the end.

We already had a reset lock like this in the past and it caused the same problems like this one.

Completely revert the patch for now and add individual trylock protection to the hardware access functions as necessary.

This reverts commit df9c8d1aa278c435c30a69b8f2418b4a52fcb929.

Signed-off-by: Christian König &lt;christian.koenig@amd.com&gt;
Acked-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The whole approach wasn't thought through till the end.

We already had a reset lock like this in the past and it caused the same problems like this one.

Completely revert the patch for now and add individual trylock protection to the hardware access functions as necessary.

This reverts commit df9c8d1aa278c435c30a69b8f2418b4a52fcb929.

Signed-off-by: Christian König &lt;christian.koenig@amd.com&gt;
Acked-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>drm/amdgpu: fix system hang issue during GPU reset</title>
<updated>2020-07-27T20:21:37+00:00</updated>
<author>
<name>Dennis Li</name>
<email>Dennis.Li@amd.com</email>
</author>
<published>2020-07-08T07:07:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=df9c8d1aa278c435c30a69b8f2418b4a52fcb929'/>
<id>df9c8d1aa278c435c30a69b8f2418b4a52fcb929</id>
<content type='text'>
when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover,
the atomic adev-&gt;in_gpu_reset and hive-&gt;in_reset are used to avoid
re-entering GPU recovery.

During GPU reset and resume, it is unsafe that other threads access GPU,
which maybe cause GPU reset failed. Therefore the new rw_semaphore
adev-&gt;reset_sem is introduced, which protect GPU from being accessed by
external threads during recovery.

v2:
1. add rwlock for some ioctls, debugfs and file-close function.
2. change to use dqm-&gt;is_resetting and dqm_lock for protection in kfd
driver.
3. remove try_lock and change adev-&gt;in_gpu_reset as atomic, to avoid
re-enter GPU recovery for the same GPU hang.

v3:
1. change back to use adev-&gt;reset_sem to protect kfd callback
functions, because dqm_lock couldn't protect all codes, for example:
free_mqd must be called outside of dqm_lock;

[ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
[ 1230.177221] Call Trace:
[ 1230.178249]  dump_stack+0x98/0xd5
[ 1230.179443]  amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu]
[ 1230.180673]  gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu]
[ 1230.181882]  amdgpu_gart_unbind+0xa9/0xe0 [amdgpu]
[ 1230.183098]  amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu]
[ 1230.184239]  ? ttm_bo_put+0x171/0x5f0 [ttm]
[ 1230.185394]  ttm_tt_unbind+0x21/0x40 [ttm]
[ 1230.186558]  ttm_tt_destroy.part.12+0x12/0x60 [ttm]
[ 1230.187707]  ttm_tt_destroy+0x13/0x20 [ttm]
[ 1230.188832]  ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm]
[ 1230.189979]  ttm_bo_put+0x1be/0x5f0 [ttm]
[ 1230.191230]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
[ 1230.192522]  amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu]
[ 1230.193833]  free_mqd+0x25/0x40 [amdgpu]
[ 1230.195143]  destroy_queue_cpsch+0x1a7/0x270 [amdgpu]
[ 1230.196475]  pqm_destroy_queue+0x105/0x260 [amdgpu]
[ 1230.197819]  kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu]
[ 1230.199154]  kfd_ioctl+0x277/0x500 [amdgpu]
[ 1230.200458]  ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu]
[ 1230.201656]  ? tomoyo_file_ioctl+0x19/0x20
[ 1230.202831]  ksys_ioctl+0x98/0xb0
[ 1230.204004]  __x64_sys_ioctl+0x1a/0x20
[ 1230.205174]  do_syscall_64+0x5f/0x250
[ 1230.206339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

2. remove try_lock and introduce atomic hive-&gt;in_reset, to avoid
re-enter GPU recovery.

v4:
1. remove an unnecessary whitespace change in kfd_chardev.c
2. remove comment codes in amdgpu_device.c
3. add more detailed comment in commit message
4. define a wrap function amdgpu_in_reset

v5:
1. Fix some style issues.

Reviewed-by: Hawking Zhang &lt;Hawking.Zhang@amd.com&gt;
Suggested-by: Andrey Grodzovsky &lt;andrey.grodzovsky@amd.com&gt;
Suggested-by: Christian König &lt;christian.koenig@amd.com&gt;
Suggested-by: Felix Kuehling &lt;Felix.Kuehling@amd.com&gt;
Suggested-by: Lijo Lazar &lt;Lijo.Lazar@amd.com&gt;
Suggested-by: Luben Tukov &lt;luben.tuikov@amd.com&gt;
Signed-off-by: Dennis Li &lt;Dennis.Li@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover,
the atomic adev-&gt;in_gpu_reset and hive-&gt;in_reset are used to avoid
re-entering GPU recovery.

During GPU reset and resume, it is unsafe that other threads access GPU,
which maybe cause GPU reset failed. Therefore the new rw_semaphore
adev-&gt;reset_sem is introduced, which protect GPU from being accessed by
external threads during recovery.

v2:
1. add rwlock for some ioctls, debugfs and file-close function.
2. change to use dqm-&gt;is_resetting and dqm_lock for protection in kfd
driver.
3. remove try_lock and change adev-&gt;in_gpu_reset as atomic, to avoid
re-enter GPU recovery for the same GPU hang.

v3:
1. change back to use adev-&gt;reset_sem to protect kfd callback
functions, because dqm_lock couldn't protect all codes, for example:
free_mqd must be called outside of dqm_lock;

[ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
[ 1230.177221] Call Trace:
[ 1230.178249]  dump_stack+0x98/0xd5
[ 1230.179443]  amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu]
[ 1230.180673]  gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu]
[ 1230.181882]  amdgpu_gart_unbind+0xa9/0xe0 [amdgpu]
[ 1230.183098]  amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu]
[ 1230.184239]  ? ttm_bo_put+0x171/0x5f0 [ttm]
[ 1230.185394]  ttm_tt_unbind+0x21/0x40 [ttm]
[ 1230.186558]  ttm_tt_destroy.part.12+0x12/0x60 [ttm]
[ 1230.187707]  ttm_tt_destroy+0x13/0x20 [ttm]
[ 1230.188832]  ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm]
[ 1230.189979]  ttm_bo_put+0x1be/0x5f0 [ttm]
[ 1230.191230]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
[ 1230.192522]  amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu]
[ 1230.193833]  free_mqd+0x25/0x40 [amdgpu]
[ 1230.195143]  destroy_queue_cpsch+0x1a7/0x270 [amdgpu]
[ 1230.196475]  pqm_destroy_queue+0x105/0x260 [amdgpu]
[ 1230.197819]  kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu]
[ 1230.199154]  kfd_ioctl+0x277/0x500 [amdgpu]
[ 1230.200458]  ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu]
[ 1230.201656]  ? tomoyo_file_ioctl+0x19/0x20
[ 1230.202831]  ksys_ioctl+0x98/0xb0
[ 1230.204004]  __x64_sys_ioctl+0x1a/0x20
[ 1230.205174]  do_syscall_64+0x5f/0x250
[ 1230.206339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

2. remove try_lock and introduce atomic hive-&gt;in_reset, to avoid
re-enter GPU recovery.

v4:
1. remove an unnecessary whitespace change in kfd_chardev.c
2. remove comment codes in amdgpu_device.c
3. add more detailed comment in commit message
4. define a wrap function amdgpu_in_reset

v5:
1. Fix some style issues.

Reviewed-by: Hawking Zhang &lt;Hawking.Zhang@amd.com&gt;
Suggested-by: Andrey Grodzovsky &lt;andrey.grodzovsky@amd.com&gt;
Suggested-by: Christian König &lt;christian.koenig@amd.com&gt;
Suggested-by: Felix Kuehling &lt;Felix.Kuehling@amd.com&gt;
Suggested-by: Lijo Lazar &lt;Lijo.Lazar@amd.com&gt;
Suggested-by: Luben Tukov &lt;luben.tuikov@amd.com&gt;
Signed-off-by: Dennis Li &lt;Dennis.Li@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>drm/amdgpu: don't do soft recovery if gpu_recovery=0</title>
<updated>2020-07-10T21:40:39+00:00</updated>
<author>
<name>Marek Olšák</name>
<email>marek.olsak@amd.com</email>
</author>
<published>2020-07-06T22:23:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=cc063ea2ec7cc091639e6c95eb93e97d6e2ed6e3'/>
<id>cc063ea2ec7cc091639e6c95eb93e97d6e2ed6e3</id>
<content type='text'>
It's impossible to debug shader hangs with soft recovery.

Signed-off-by: Marek Olšák &lt;marek.olsak@amd.com&gt;
Reviewed-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
Reviewed-by: Christian König &lt;christian.koenig@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
It's impossible to debug shader hangs with soft recovery.

Signed-off-by: Marek Olšák &lt;marek.olsak@amd.com&gt;
Reviewed-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
Reviewed-by: Christian König &lt;christian.koenig@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>drm/amdgpu: remove distinction between explicit and implicit sync (v2)</title>
<updated>2020-07-01T05:59:22+00:00</updated>
<author>
<name>Christian König</name>
<email>christian.koenig@amd.com</email>
</author>
<published>2020-05-27T08:31:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=174b328bc89b5fa9cfec57831670e8100a1e0da0'/>
<id>174b328bc89b5fa9cfec57831670e8100a1e0da0</id>
<content type='text'>
According to Marek a pipeline sync should be inserted for implicit syncs well.

v2: bump the driver version

Signed-off-by: Christian König &lt;christian.koenig@amd.com&gt;
Tested-by: Marek Olšák &lt;marek.olsak@amd.com&gt;
Signed-off-by: Marek Olšák &lt;marek.olsak@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
According to Marek a pipeline sync should be inserted for implicit syncs well.

v2: bump the driver version

Signed-off-by: Christian König &lt;christian.koenig@amd.com&gt;
Tested-by: Marek Olšák &lt;marek.olsak@amd.com&gt;
Signed-off-by: Marek Olšák &lt;marek.olsak@amd.com&gt;
Signed-off-by: Alex Deucher &lt;alexander.deucher@amd.com&gt;
</pre>
</div>
</content>
</entry>
</feed>
