linux-stable.git/drivers/gpu/drm/scheduler, branch linux-5.15.y

drm/sched: Fix race in drm_sched_entity_select_rq()

2025-12-06T21:09:15+00:00

[ Upstream commit d25e3a610bae03bffc5c14b5d944a5d0cd844678 ]

In a past bug fix it was forgotten that entity access must be protected
by the entity lock. That's a data race and potentially UB.

Move the spin_unlock() to the appropriate position.

Cc: stable@vger.kernel.org # v5.13+
Fixes: ac4eb83ab255 ("drm/sched: select new rq even if there is only one v3")
Reviewed-by: Tvrtko Ursulin 
Signed-off-by: Philipp Stanner 
Link: https://patch.msgid.link/20251022063402.87318-2-phasta@kernel.org
[ adapted lock field name from entity->lock to entity->rq_lock ]
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

drm/sched: Remove optimization that causes hang when killing dependent jobs

2025-08-28T14:24:30+00:00

[ Upstream commit 15f77764e90a713ee3916ca424757688e4f565b9 ]

When application A submits jobs and application B submits a job with a
dependency on A's fence, the normal flow wakes up the scheduler after
processing each job. However, the optimization in
drm_sched_entity_add_dependency_cb() uses a callback that only clears
dependencies without waking up the scheduler.

When application A is killed before its jobs can run, the callback gets
triggered but only clears the dependency without waking up the scheduler,
causing the scheduler to enter sleep state and application B to hang.

Remove the optimization by deleting drm_sched_entity_clear_dep() and its
usage, ensuring the scheduler is always woken up when dependencies are
cleared.

Fixes: 777dbd458c89 ("drm/amdgpu: drop a dummy wakeup scheduler")
Cc: stable@vger.kernel.org # v4.6+
Signed-off-by: Lin.Cao 
Reviewed-by: Christian König 
Signed-off-by: Philipp Stanner 
Link: https://lore.kernel.org/r/20250717084453.921097-1-lincao12@amd.com
[ Changed drm_sched_wakeup_if_can_queue() calls to drm_sched_wakeup() ]
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

drm/sched: Fix preprocessor guard

2025-03-13T11:51:05+00:00

[ Upstream commit 23e0832d6d7be2d3c713f9390c060b6f1c48bf36 ]

When writing the header guard for gpu_scheduler_trace.h, a typo,
apparently, occurred.

Fix the typo and document the scope of the guard.

Fixes: 353da3c520b4 ("drm/amdgpu: add tracepoint for scheduler (v2)")
Reviewed-by: Tvrtko Ursulin 
Signed-off-by: Philipp Stanner 
Link: https://patchwork.freedesktop.org/patch/msgid/20250218124149.118002-2-phasta@kernel.org
Signed-off-by: Sasha Levin

drm/sched: Add locking to drm_sched_entity_modify_sched

2024-10-17T13:11:42+00:00

commit 4286cc2c953983d44d248c9de1c81d3a9643345c upstream.

Without the locking amdgpu currently can race between
amdgpu_ctx_set_entity_priority() (via drm_sched_entity_modify_sched()) and
drm_sched_job_arm(), leading to the latter accesing potentially
inconsitent entity->sched_list and entity->num_sched_list pair.

v2:
 * Improve commit message. (Philipp)

Signed-off-by: Tvrtko Ursulin 
Fixes: b37aced31eb0 ("drm/scheduler: implement a function to modify sched list")
Cc: Christian König 
Cc: Alex Deucher 
Cc: Luben Tuikov 
Cc: Matthew Brost 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: dri-devel@lists.freedesktop.org
Cc: Philipp Stanner 
Cc:  # v5.7+
Reviewed-by: Christian König 
Link: https://patchwork.freedesktop.org/patch/msgid/20240913160559.49054-2-tursulin@igalia.com
Signed-off-by: Christian König 
Signed-off-by: Greg Kroah-Hartman

dma-buf: add dma_fence_timestamp helper

2024-02-23T07:55:10+00:00

[ Upstream commit b83ce9cb4a465b8f9a3fa45561b721a9551f60e3 ]

When a fence signals there is a very small race window where the timestamp
isn't updated yet. sync_file solves this by busy waiting for the
timestamp to appear, but on other ocassions didn't handled this
correctly.

Provide a dma_fence_timestamp() helper function for this and use it in
all appropriate cases.

Another alternative would be to grab the spinlock when that happens.

v2 by teddy: add a wait parameter to wait for the timestamp to show up, in case
   the accurate timestamp is needed and/or the timestamp is not based on
   ktime (e.g. hw timestamp)
v3 chk: drop the parameter again for unified handling

Signed-off-by: Yunxiang Li 
Signed-off-by: Christian König 
Fixes: 1774baa64f93 ("drm/scheduler: Change scheduled fence track v2")
Reviewed-by: Alex Deucher 
CC: stable@vger.kernel.org
Link: https://patchwork.freedesktop.org/patch/msgid/20230929104725.2358-1-christian.koenig@amd.com
Signed-off-by: Sasha Levin

drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr

2021-07-01T06:53:25+00:00

Mali Midgard/Bifrost GPUs have 3 hardware queues but only a global GPU
reset. This leads to extra complexity when we need to synchronize timeout
works with the reset work. One solution to address that is to have an
ordered workqueue at the driver level that will be used by the different
schedulers to queue their timeout work. Thanks to the serialization
provided by the ordered workqueue we are guaranteed that timeout
handlers are executed sequentially, and can thus easily reset the GPU
from the timeout handler without extra synchronization.

v5:
* Add a new paragraph to the timedout_job() method

v3:
* New patch

v4:
* Actually use the timeout_wq to queue the timeout work

Suggested-by: Daniel Vetter 
Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
Reviewed-by: Lucas Stach 
Acked-by: Daniel Vetter 
Acked-by: Christian König 
Cc: Qiang Yu 
Cc: Emma Anholt 
Cc: Alex Deucher 
Cc: "Christian König" 
Link: https://patchwork.freedesktop.org/patch/msgid/20210630062751.2832545-3-boris.brezillon@collabora.com

drm/sched: Declare entity idle only after HW submission

2021-06-28T11:16:49+00:00

The panfrost driver tries to kill in-flight jobs on FD close after
destroying the FD scheduler entities. For this to work properly, we
need to make sure the jobs popped from the scheduler entities have
been queued at the HW level before declaring the entity idle, otherwise
we might iterate over a list that doesn't contain those jobs.

Suggested-by: Lucas Stach 
Signed-off-by: Boris Brezillon 
Cc: Lucas Stach 
Reviewed-by: Steven Price 
Reviewed-by: Lucas Stach 
Link: https://patchwork.freedesktop.org/patch/msgid/20210624140850.2229697-1-boris.brezillon@collabora.com

drm/sched: Avoid data corruptions

2021-05-20T03:50:28+00:00

Wait for all dependencies of a job  to complete before
killing it to avoid data corruptions.

Signed-off-by: Andrey Grodzovsky 
Reviewed-by: Christian König 
Link: https://patchwork.freedesktop.org/patch/msgid/20210519141407.88444-1-andrey.grodzovsky@amd.com

drm/scheduler: Fix hang when sched_entity released

2021-05-20T03:50:28+00:00

Problem: If scheduler is already stopped by the time sched_entity
is released and entity's job_queue not empty I encountred
a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle
never becomes false.

Fix: In drm_sched_fini detach all sched_entities from the
scheduler's run queues. This will satisfy drm_sched_entity_is_idle.
Also wakeup all those processes stuck in sched_entity flushing
as the scheduler main thread which wakes them up is stopped by now.

v2:
Reverse order of drm_sched_rq_remove_entity and marking
s_entity as stopped to prevent reinserion back to rq due
to race.

v3:
Drop drm_sched_rq_remove_entity, only modify entity->stopped
and check for it in drm_sched_entity_is_idle

Signed-off-by: Andrey Grodzovsky 
Reviewed-by: Christian König 
Link: https://patchwork.freedesktop.org/patch/msgid/20210512142648.666476-14-andrey.grodzovsky@amd.com

drm/sched: Make timeout timer rearm conditional.

2021-05-20T03:50:28+00:00

We don't want to rearm the timer if driver hook reports
that the device is gone.

v5: Update drm_gpu_sched_stat values in code.

Signed-off-by: Andrey Grodzovsky 
Reviewed-by: Christian König 
Link: https://patchwork.freedesktop.org/patch/msgid/20210512142648.666476-11-andrey.grodzovsky@amd.com