linux-stable.git/drivers/gpu/drm/xe, branch linux-6.10.y

drm/xe: fix UAF around queue destruction

2024-10-10T10:01:10+00:00

[ Upstream commit 2d2be279f1ca9e7288282d4214f16eea8a727cdb ]

We currently do stuff like queuing the final destruction step on a
random system wq, which will outlive the driver instance. With bad
timing we can teardown the driver with one or more work workqueue still
being alive leading to various UAF splats. Add a fini step to ensure
user queues are properly torn down. At this point GuC should already be
nuked so queue itself should no longer be referenced from hw pov.

v2 (Matt B)
 - Looks much safer to use a waitqueue and then just wait for the
   xa_array to become empty before triggering the drain.

Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2317
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Matthew Auld 
Cc: Matthew Brost 
Cc:  # v6.8+
Reviewed-by: Matthew Brost 
Link: https://patchwork.freedesktop.org/patch/msgid/20240923145647.77707-2-matthew.auld@intel.com
(cherry picked from commit 861108666cc0e999cffeab6aff17b662e68774e3)
Signed-off-by: Lucas De Marchi 
Signed-off-by: Sasha Levin

drm/xe: Delete unused GuC submission_state.suspend

2024-10-10T10:01:10+00:00

[ Upstream commit 3f371a98deada9aee53d908c9aa53f6cdcb1300b ]

GuC submission_state.suspend is unused, delete it.

Signed-off-by: Matthew Brost 
Reviewed-by: Himal Prasad Ghimiray 
Link: https://patchwork.freedesktop.org/patch/msgid/20240425054747.1918811-1-matthew.brost@intel.com
Stable-dep-of: 2d2be279f1ca ("drm/xe: fix UAF around queue destruction")
Signed-off-by: Sasha Levin

drm/xe: Fix memory leak on xe_alloc_pf_queue failure

2024-10-10T10:00:43+00:00

[ Upstream commit c5f728de696caa35481fd84202dfbc9fecc18e0b ]

Simplify memory unwinding on error also fixing current memory
leak that can happen on error.

v2: use devm_kcalloc(Matt A)

Fixes: 3338e4f90c14 ("drm/xe: Use topology to determine page fault queue size")
Cc: Matthew Auld 
Cc: Matthew Brost 
Cc: Rodrigo Vivi 
Cc: Stuart Summers 
Reviewed-by: Matthew Auld 
Link: https://patchwork.freedesktop.org/patch/msgid/20240826162035.20462-1-nirmoy.das@intel.com
Signed-off-by: Nirmoy Das 
Signed-off-by: Sasha Levin

drm/xe: fixup xe_alloc_pf_queue

2024-10-10T10:00:43+00:00

[ Upstream commit 321d6b4b9cbe3dd0bc99937d5e5b4d730b5b5798 ]

kzalloc expects number of bytes, therefore we should convert the number
of dw into bytes, otherwise we are likely just accessing beyond the
array causing all kinds of carnage. Also fixup the error handling while
we are here.

v2:
 - Prefer kcalloc (dim)

Fixes: 3338e4f90c14 ("drm/xe: Use topology to determine page fault queue size")
Signed-off-by: Matthew Auld 
Cc: Stuart Summers 
Cc: Matthew Brost 
Reviewed-by: Nirmoy Das 
Signed-off-by: Matthew Brost 
Link: https://patchwork.freedesktop.org/patch/msgid/20240821171917.417386-2-matthew.auld@intel.com
Signed-off-by: Sasha Levin

drm/xe: Drop warn on xe_guc_pc_gucrc_disable in guc pc fini

2024-10-10T10:00:38+00:00

[ Upstream commit a323782567812ee925e9b7926445532c7afe331b ]

Not a big deal if CT is down as driver is unloading, no need to warn.

Signed-off-by: Matthew Brost 
Reviewed-by: Jagmeet Randhawa 
Link: https://patchwork.freedesktop.org/patch/msgid/20240820172958.1095143-4-matthew.brost@intel.com
Signed-off-by: Sasha Levin

drm/xe: Use topology to determine page fault queue size

2024-10-10T10:00:37+00:00

[ Upstream commit 3338e4f90c143cf32f77d64f464cb7f2c2d24700 ]

Currently the page fault queue size is hard coded. However
the hardware supports faulting for each EU and each CS.
For some applications running on hardware with a large
number of EUs and CSs, this can result in an overflow of
the page fault queue.

Add a small calculation to determine the page fault queue
size based on the number of EUs and CSs in the platform as
detmined by fuses.

Signed-off-by: Stuart Summers 
Reviewed-by: Matthew Brost 
Signed-off-by: Matthew Brost 
Link: https://patchwork.freedesktop.org/patch/msgid/24d582a3b48c97793b8b6a402f34b4b469471636.1723862633.git.stuart.summers@intel.com
Signed-off-by: Sasha Levin

drm/xe/hdcp: Check GSC structure validity

2024-10-10T10:00:29+00:00

[ Upstream commit b4224f6bae3801d589f815672ec62800a1501b0d ]

Sometimes xe_gsc is not initialized when checked at HDCP capability
check. Add gsc structure check to avoid null pointer error.

Signed-off-by: Suraj Kandpal 
Reviewed-by: Dnyaneshwar Bhadane 
Link: https://patchwork.freedesktop.org/patch/msgid/20240722064451.3610512-4-suraj.kandpal@intel.com
Signed-off-by: Sasha Levin

drm/xe: Prevent null pointer access in xe_migrate_copy

2024-10-10T10:00:12+00:00

[ Upstream commit 7257d9c9a3c6cfe26c428e9b7ae21d61f2f55a79 ]

xe_migrate_copy designed to copy content of TTM resources. When source
resource is null, it will trigger a NULL pointer dereference in
xe_migrate_copy. To avoid this situation, update lacks source flag to
true for this case, the flag will trigger xe_migrate_clear rather than
xe_migrate_copy.

Issue trace:
<7> [317.089847] xe 0000:00:02.0: [drm:xe_migrate_copy [xe]] Pass 14,
 sizes: 4194304 & 4194304
<7> [317.089945] xe 0000:00:02.0: [drm:xe_migrate_copy [xe]] Pass 15,
 sizes: 4194304 & 4194304
<1> [317.128055] BUG: kernel NULL pointer dereference, address:
 0000000000000010
<1> [317.128064] #PF: supervisor read access in kernel mode
<1> [317.128066] #PF: error_code(0x0000) - not-present page
<6> [317.128069] PGD 0 P4D 0
<4> [317.128071] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
<4> [317.128074] CPU: 1 UID: 0 PID: 1440 Comm: kunit_try_catch Tainted:
 G     U           N 6.11.0-rc7-xe #1
<4> [317.128078] Tainted: [U]=USER, [N]=TEST
<4> [317.128080] Hardware name: Intel Corporation Lunar Lake Client
 Platform/LNL-M LP5 RVP1, BIOS LNLMFWI1.R00.3221.D80.2407291239 07/29/2024
<4> [317.128082] RIP: 0010:xe_migrate_copy+0x66/0x13e0 [xe]
<4> [317.128158] Code: 00 00 48 89 8d e0 fe ff ff 48 8b 40 10 4c 89 85 c8
 fe ff ff 44 88 8d bd fe ff ff 65 48 8b 3c 25 28 00 00 00 48 89 7d d0 31
 ff <8b> 79 10 48 89 85 a0 fe ff ff 48 8b 00 48 89 b5 d8 fe ff ff 83 ff
<4> [317.128162] RSP: 0018:ffffc9000167f9f0 EFLAGS: 00010246
<4> [317.128164] RAX: ffff8881120d8028 RBX: ffff88814d070428 RCX:
 0000000000000000
<4> [317.128166] RDX: ffff88813cb99c00 RSI: 0000000004000000 RDI:
 0000000000000000
<4> [317.128168] RBP: ffffc9000167fbb8 R08: ffff88814e7b1f08 R09:
 0000000000000001
<4> [317.128170] R10: 0000000000000001 R11: 0000000000000001 R12:
 ffff88814e7b1f08
<4> [317.128172] R13: ffff88814e7b1f08 R14: ffff88813cb99c00 R15:
 0000000000000001
<4> [317.128174] FS:  0000000000000000(0000) GS:ffff88846f280000(0000)
 knlGS:0000000000000000
<4> [317.128176] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [317.128178] CR2: 0000000000000010 CR3: 000000011f676004 CR4:
 0000000000770ef0
<4> [317.128180] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
 0000000000000000
<4> [317.128182] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7:
 0000000000000400
<4> [317.128184] PKRU: 55555554
<4> [317.128185] Call Trace:
<4> [317.128187]  
<4> [317.128189]  ? show_regs+0x67/0x70
<4> [317.128194]  ? __die_body+0x20/0x70
<4> [317.128196]  ? __die+0x2b/0x40
<4> [317.128198]  ? page_fault_oops+0x15f/0x4e0
<4> [317.128203]  ? do_user_addr_fault+0x3fb/0x970
<4> [317.128205]  ? lock_acquire+0xc7/0x2e0
<4> [317.128209]  ? exc_page_fault+0x87/0x2b0
<4> [317.128212]  ? asm_exc_page_fault+0x27/0x30
<4> [317.128216]  ? xe_migrate_copy+0x66/0x13e0 [xe]
<4> [317.128263]  ? __lock_acquire+0xb9d/0x26f0
<4> [317.128265]  ? __lock_acquire+0xb9d/0x26f0
<4> [317.128267]  ? sg_free_append_table+0x20/0x80
<4> [317.128271]  ? lock_acquire+0xc7/0x2e0
<4> [317.128273]  ? mark_held_locks+0x4d/0x80
<4> [317.128275]  ? trace_hardirqs_on+0x1e/0xd0
<4> [317.128278]  ? _raw_spin_unlock_irqrestore+0x31/0x60
<4> [317.128281]  ? __pm_runtime_resume+0x60/0xa0
<4> [317.128284]  xe_bo_move+0x682/0xc50 [xe]
<4> [317.128315]  ? lock_is_held_type+0xaa/0x120
<4> [317.128318]  ttm_bo_handle_move_mem+0xe5/0x1a0 [ttm]
<4> [317.128324]  ttm_bo_validate+0xd1/0x1a0 [ttm]
<4> [317.128328]  shrink_test_run_device+0x721/0xc10 [xe]
<4> [317.128360]  ? find_held_lock+0x31/0x90
<4> [317.128363]  ? lock_release+0xd1/0x2a0
<4> [317.128365]  ? __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10
 [kunit]
<4> [317.128370]  xe_bo_shrink_kunit+0x11/0x20 [xe]
<4> [317.128397]  kunit_try_run_case+0x6e/0x150 [kunit]
<4> [317.128400]  ? trace_hardirqs_on+0x1e/0xd0
<4> [317.128402]  ? _raw_spin_unlock_irqrestore+0x31/0x60
<4> [317.128404]  kunit_generic_run_threadfn_adapter+0x1e/0x40 [kunit]
<4> [317.128407]  kthread+0xf5/0x130
<4> [317.128410]  ? __pfx_kthread+0x10/0x10
<4> [317.128412]  ret_from_fork+0x39/0x60
<4> [317.128415]  ? __pfx_kthread+0x10/0x10
<4> [317.128416]  ret_from_fork_asm+0x1a/0x30
<4> [317.128420]  

Fixes: 266c85885263 ("drm/xe/xe2: Handle flat ccs move for igfx.")
Signed-off-by: Zhanjun Dong 
Reviewed-by: Thomas Hellström 
Signed-off-by: Matt Roper 
Link: https://patchwork.freedesktop.org/patch/msgid/20240927161308.862323-2-zhanjun.dong@intel.com
(cherry picked from commit 59a1c9c7e1d02b43b415ea92627ce095b7c79e47)
Signed-off-by: Lucas De Marchi 
Signed-off-by: Sasha Levin

drm/xe: Resume TDR after GT reset

2024-10-10T10:00:11+00:00

[ Upstream commit 1b30f87e088b499eb74298db256da5c98e8276e2 ]

Not starting the TDR after GT reset on exec queue which have been
restarted can lead to jobs being able to be run forever. Fix this by
restarting the TDR.

Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Signed-off-by: Matthew Brost 
Reviewed-by: Nirmoy Das 
Link: https://patchwork.freedesktop.org/patch/msgid/20240724235919.1917216-1-matthew.brost@intel.com
(cherry picked from commit 8ec5a4e5ce97d6ee9f5eb5b4ce4cfc831976fdec)
Signed-off-by: Lucas De Marchi 
Signed-off-by: Sasha Levin

drm/xe: Restore pci state upon resume

2024-10-10T10:00:11+00:00

[ Upstream commit cffa8e83df9fe525afad1e1099097413f9174f57 ]

The pci state was saved, but not restored. Restore
right after the power state transition request like
every other driver.

v2: Use right fixes tag, since this was there initialy, but
    accidentally removed.

Fixes: f6761c68c0ac ("drm/xe/display: Improve s2idle handling.")
Cc: Maarten Lankhorst 
Cc: Lucas De Marchi 
Reviewed-by: Jonathan Cavitt 
Signed-off-by: Rodrigo Vivi 
Link: https://patchwork.freedesktop.org/patch/msgid/20240912214507.456897-1-rodrigo.vivi@intel.com
Signed-off-by: Maarten Lankhorst 
(cherry picked from commit ec2d1539e159f53eae708e194c449cfefa004994)
Signed-off-by: Lucas De Marchi 
Signed-off-by: Sasha Levin