diff options
| author | Yunxiang Li <Yunxiang.Li@amd.com> | 2026-06-01 15:15:06 -0400 |
|---|---|---|
| committer | Alex Deucher <alexander.deucher@amd.com> | 2026-06-03 13:58:57 -0400 |
| commit | baa286df5fb366e4aee5b747443dfc7194fdaa59 (patch) | |
| tree | 9197b254385bfc75a89c0b6796cb3b2f591e6389 | |
| parent | 43355f62cd2ef5386c2693df537c232ea0f2ce6c (diff) | |
drm/amdgpu: set sub_block_index for mca ras sub-blocks
The mca ras sub-blocks (mp0, mp1, mpio) all share the
AMDGPU_RAS_BLOCK__MCA block id and are distinguished only by
sub_block_index. The ras manager object for an mca block is selected
with:
con->objs[AMDGPU_RAS_BLOCK__LAST + head->sub_block_index]
Since the rework in commit 7f544c5488cf ("drm/amdgpu: Rework mca ras
sw_init") moved the ras_comm setup into amdgpu_mca_mp*_ras_sw_init() but
left sub_block_index unset, mp0/mp1/mpio all default to index 0 and
collide on the same object slot. mp0 grabs the slot and creates its
sysfs node first; mp1 (and mpio) then find the slot already in use, so
amdgpu_ras_block_late_init() -> amdgpu_ras_sysfs_create() returns
-EINVAL:
amdgpu: mca.mp1 failed to execute ras_block_late_init_default! ret:-22
amdgpu: amdgpu_ras_late_init failed -22
amdgpu: amdgpu_device_ip_late_init failed
amdgpu: Fatal error during GPU init
The error is currently masked because amdgpu_ras_late_init() does not
check the return value of amdgpu_ras_block_late_init_default(), but it
already leaves mp1/mpio without their sysfs nodes and becomes a fatal
init failure as soon as that return value is honored.
Restore the per-sub-block sub_block_index assignment so each mca
sub-block maps to its own object slot.
Fixes: 7f544c5488cf ("drm/amdgpu: Rework mca ras sw_init")
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
| -rw-r--r-- | drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c | 3 |
1 files changed, 3 insertions, 0 deletions
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c index 823ba17e32af..cc6d1a4e4c3a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mca.c @@ -99,6 +99,7 @@ int amdgpu_mca_mp0_ras_sw_init(struct amdgpu_device *adev) strcpy(ras->ras_block.ras_comm.name, "mca.mp0"); ras->ras_block.ras_comm.block = AMDGPU_RAS_BLOCK__MCA; + ras->ras_block.ras_comm.sub_block_index = AMDGPU_RAS_MCA_BLOCK__MP0; ras->ras_block.ras_comm.type = AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE; adev->mca.mp0.ras_if = &ras->ras_block.ras_comm; @@ -123,6 +124,7 @@ int amdgpu_mca_mp1_ras_sw_init(struct amdgpu_device *adev) strcpy(ras->ras_block.ras_comm.name, "mca.mp1"); ras->ras_block.ras_comm.block = AMDGPU_RAS_BLOCK__MCA; + ras->ras_block.ras_comm.sub_block_index = AMDGPU_RAS_MCA_BLOCK__MP1; ras->ras_block.ras_comm.type = AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE; adev->mca.mp1.ras_if = &ras->ras_block.ras_comm; @@ -147,6 +149,7 @@ int amdgpu_mca_mpio_ras_sw_init(struct amdgpu_device *adev) strcpy(ras->ras_block.ras_comm.name, "mca.mpio"); ras->ras_block.ras_comm.block = AMDGPU_RAS_BLOCK__MCA; + ras->ras_block.ras_comm.sub_block_index = AMDGPU_RAS_MCA_BLOCK__MPIO; ras->ras_block.ras_comm.type = AMDGPU_RAS_ERROR__MULTI_UNCORRECTABLE; adev->mca.mpio.ras_if = &ras->ras_block.ras_comm; |
