linux-stable.git/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c, branch v6.2

drm/amdkfd: Fix NULL pointer error for GC 11.0.1 on mGPU

2023-01-10T23:11:08+00:00

The point bo->kfd_bo is NULL for queue's write pointer BO
when creating queue on mGPU. To avoid using the pointer
fixes the error.

Signed-off-by: Eric Huang 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher

drm/amdkfd: Fix double release compute pasid

2022-12-20T17:58:06+00:00

If kfd_process_device_init_vm returns failure after vm is converted to
compute vm and vm->pasid set to compute pasid, KFD will not take
pdd->drm_file reference. As a result, drm close file handler maybe
called to release the compute pasid before KFD process destroy worker to
release the same pasid and set vm->pasid to zero, this generates below
WARNING backtrace and NULL pointer access.

Add helper amdgpu_amdkfd_gpuvm_set_vm_pasid and call it at the last step
of kfd_process_device_init_vm, to ensure vm pasid is the original pasid
if acquiring vm failed or is the compute pasid with pdd->drm_file
reference taken to avoid double release same pasid.

 amdgpu: Failed to create process VM object
 ida_free called for id=32770 which is not allocated.
 WARNING: CPU: 57 PID: 72542 at ../lib/idr.c:522 ida_free+0x96/0x140
 RIP: 0010:ida_free+0x96/0x140
 Call Trace:
  amdgpu_pasid_free_delayed+0xe1/0x2a0 [amdgpu]
  amdgpu_driver_postclose_kms+0x2d8/0x340 [amdgpu]
  drm_file_free.part.13+0x216/0x270 [drm]
  drm_close_helper.isra.14+0x60/0x70 [drm]
  drm_release+0x6e/0xf0 [drm]
  __fput+0xcc/0x280
  ____fput+0xe/0x20
  task_work_run+0x96/0xc0
  do_exit+0x3d0/0xc10

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 RIP: 0010:ida_free+0x76/0x140
 Call Trace:
  amdgpu_pasid_free_delayed+0xe1/0x2a0 [amdgpu]
  amdgpu_driver_postclose_kms+0x2d8/0x340 [amdgpu]
  drm_file_free.part.13+0x216/0x270 [drm]
  drm_close_helper.isra.14+0x60/0x70 [drm]
  drm_release+0x6e/0xf0 [drm]
  __fput+0xcc/0x280
  ____fput+0xe/0x20
  task_work_run+0x96/0xc0
  do_exit+0x3d0/0xc10

Signed-off-by: Philip Yang 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher

drm/amdgpu: Add notifier lock for KFD userptrs

2022-12-14T14:48:05+00:00

Add a per-process MMU notifier lock for processing notifiers from
userptrs. Use that lock to properly synchronize page table updates with
MMU notifiers.

Signed-off-by: Felix Kuehling 
Reviewed-by: Xiaogang Chen
Signed-off-by: Alex Deucher

drm/amdkfd: Fix memory leakage

2022-11-29T16:03:39+00:00

This patch fixes potential memory leakage and seg fault
in  _gpuvm_import_dmabuf() function

Fixes: d4ec4bdc0bd5 ("drm/amdkfd: Allow access for mmapping KFD BOs")
Signed-off-by: Konstantin Meskhidze 
Signed-off-by: Felix Kuehling 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher

drm/amdgpu: rename the files for HMM handling

2022-11-17T05:23:36+00:00

Clean that up a bit, no functional change.

Signed-off-by: Christian König 
Reviewed-by: Alex Deucher 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher

drm/amdgpu: fix userptr HMM range handling v2

2022-11-17T05:23:30+00:00

The basic problem here is that it's not allowed to page fault while
holding the reservation lock.

So it can happen that multiple processes try to validate an userptr
at the same time.

Work around that by putting the HMM range object into the mutex
protected bo list for now.

v2: make sure range is set to NULL in case of an error

Signed-off-by: Christian König 
Reviewed-by: Alex Deucher 
Reviewed-by: Felix Kuehling 
CC: stable@vger.kernel.org
Signed-off-by: Alex Deucher

drm/amdkfd: Fix a memory limit issue

2022-11-15T18:35:16+00:00

It is to resolve a regression, which fails to allocate
VRAM due to no free memory in application, the reason
is we add check of vram_pin_size for memory limit, and
application is pinning the memory for Peerdirect, KFD
should not count it in memory limit. So removing
vram_pin_size will resolve it.

Signed-off-by: Eric Huang 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher

drm/amdgpu: Set MTYPE in PTE based on BO flags

2022-11-09T22:41:42+00:00

The same BO may need different MTYPEs and SNOOP flags in PTEs depending
on its current location relative to the mapping GPU. Setting MTYPEs from
clients ahead of time is not practical for coherent memory sharing.
Instead determine the correct MTYPE for the desired coherence model and
current BO location when updating the page tables.

To maintain backwards compatibility with MTYPE-selection in
AMDGPU_VA_OP_MAP, the coherence-model-based MTYPE selection is only
applied if it chooses an MTYPE other than MTYPE_NC (the default).

Add two AMDGPU_GEM_CREATE_... flags to indicate the coherence model. The
default if no flag is specified is non-coherent (i.e. coarse-grained
coherent at dispatch boundaries).

Update amdgpu_amdkfd_gpuvm.c to use this new method to choose the
correct MTYPE depending on the current memory location.

v2:
* check that bo is not NULL (e.g. PRT mappings)
* Fix missing ~ bitmask in gmc_v11_0.c
v3:
* squash in "drm/amdgpu: Inherit coherence flags on dmabuf import"

Suggested-by: Christian König 
Signed-off-by: Felix Kuehling 
Acked-by: Christian König 
Signed-off-by: Alex Deucher

drm/amdkfd: Remove skiping userptr buffer mapping when mmu notifier marks it as invalid

2022-11-04T20:05:54+00:00

mmu notifier does not always hold mm->sem during call back. That causes
a race condition between kfd userprt buffer mapping and mmu notifier
which leds to gpu shadder or SDMA access userptr buffer before it has been
mapped to gpu VM. Always map userptr buffer to avoid that though it may make
some userprt buffers mapped two times.

Suggested-by: Felix Kuehling 
Signed-off-by: Xiaogang Chen 
Reviewed-by: Felix Kuehling 
Signed-off-by: Alex Deucher

drm/amdkfd: Fix memory leak in kfd_mem_dmamap_userptr()

2022-10-24T18:34:27+00:00

If the number of pages from the userptr BO differs from the SG BO then the
allocated memory for the SG table doesn't get freed before returning
-EINVAL, which may lead to a memory leak in some error paths. Fix this by
checking the number of pages before allocating memory for the SG table.

Fixes: 264fb4d332f5 ("drm/amdgpu: Add multi-GPU DMA mapping helpers")
Signed-off-by: Rafael Mendonca 
Reviewed-by: Felix Kuehling 
Signed-off-by: Felix Kuehling 
Signed-off-by: Alex Deucher