linux-stable.git/drivers/edac, branch linux-6.15.y

EDAC/synopsys: Clear the ECC counters on init

2025-08-20T16:35:55+00:00

[ Upstream commit b1dc7f097b78eb8d25b071ead2384b07a549692b ]

Clear the ECC error and counter registers during initialization/probe to avoid
reporting stale errors that may have occurred before EDAC registration.

For that, unify the Zynq and ZynqMP ECC state reading paths and simplify the
code.

  [ bp: Massage commit message.
    Fix an -Wsometimes-uninitialized warning as reported by
    Reported-by: kernel test robot 
    Closes: https://lore.kernel.org/oe-kbuild-all/202507141048.obUv3ZUm-lkp@intel.com ]

Signed-off-by: Shubhrajyoti Datta 
Signed-off-by: Borislav Petkov (AMD) 
Link: https://lore.kernel.org/20250713050753.7042-1-shubhrajyoti.datta@amd.com
Signed-off-by: Sasha Levin

EDAC: Initialize EDAC features sysfs attributes

2025-07-17T16:43:38+00:00

[ Upstream commit 1e14ea901dc8d976d355ddc3e0de84ee86ef0596 ]

Fix the lockdep splat caused by missing sysfs_attr_init() calls for the
recently added EDAC feature's sysfs attributes.

In lockdep_init_map_type(), the check for the lock-class key if
(!static_obj(key) && !is_dynamic_key(key)) causes the splat.

  Backtrace:
  RIP: 0010:lockdep_init_map_type
  Call Trace:
   __kernfs_create_file
  sysfs_add_file_mode_ns
  internal_create_group
  internal_create_groups
  device_add
  ? __init_waitqueue_head
  edac_dev_register
  devm_cxl_memdev_edac_register
  ? lock_acquire
  ? find_held_lock
  ? cxl_mem_probe
  ? cxl_mem_probe
  ? lockdep_hardirqs_on
  ? cxl_mem_probe
  cxl_mem_probe

  [ bp: Massage. ]

Fixes: f90b738166fe ("EDAC: Add scrub control feature")
Fixes: bcbd069b11b0 ("EDAC: Add a Error Check Scrub control feature")
Fixes: 699ea5219c4b ("EDAC: Add a memory repair control feature")
Reported-by: Dave Jiang 
Suggested-by: Jonathan Cameron 
Signed-off-by: Shiju Jose 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Jonathan Cameron 
Link: https://lore.kernel.org/20250626101344.1726-1-shiju.jose@huawei.com
Signed-off-by: Sasha Levin

EDAC/amd64: Fix size calculation for Non-Power-of-Two DIMMs

2025-07-06T09:04:17+00:00

commit a3f3040657417aeadb9622c629d4a0c2693a0f93 upstream.

Each Chip-Select (CS) of a Unified Memory Controller (UMC) on AMD Zen-based
SOCs has an Address Mask and a Secondary Address Mask register associated with
it. The amd64_edac module logs DIMM sizes on a per-UMC per-CS granularity
during init using these two registers.

Currently, the module primarily considers only the Address Mask register for
computing DIMM sizes. The Secondary Address Mask register is only considered
for odd CS. Additionally, if it has been considered, the Address Mask register
is ignored altogether for that CS. For power-of-two DIMMs i.e. DIMMs whose
total capacity is a power of two (32GB, 64GB, etc), this is not an issue
since only the Address Mask register is used.

For non-power-of-two DIMMs i.e., DIMMs whose total capacity is not a power of
two (48GB, 96GB, etc), however, the Secondary Address Mask register is used
in conjunction with the Address Mask register. However, since the module only
considers either of the two registers for a CS, the size computed by the
module is incorrect. The Secondary Address Mask register is not considered for
even CS, and the Address Mask register is not considered for odd CS.

Introduce a new helper function so that both Address Mask and Secondary
Address Mask registers are considered, when valid, for computing DIMM sizes.
Furthermore, also rename some variables for greater clarity.

Fixes: 81f5090db843 ("EDAC/amd64: Support asymmetric dual-rank DIMMs")
Closes: https://lore.kernel.org/dbec22b6-00f2-498b-b70d-ab6f8a5ec87e@natrix.lt
Reported-by: Žilvinas Žaltiena 
Signed-off-by: Avadhut Naik 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Yazen Ghannam 
Tested-by: Žilvinas Žaltiena 
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250529205013.403450-1-avadhut.naik@amd.com
Signed-off-by: Greg Kroah-Hartman

EDAC/igen6: Fix NULL pointer dereference

2025-06-27T10:13:41+00:00

commit 88efa0de3285be66969b71ec137d9dab1ee19e52 upstream.

A kernel panic was reported with the following kernel log:

  EDAC igen6: Expected 2 mcs, but only 1 detected.
  BUG: unable to handle page fault for address: 000000000000d570
  ...
  Hardware name: Notebook V54x_6x_TU/V54x_6x_TU, BIOS Dasharo (coreboot+UEFI) v0.9.0 07/17/2024
  RIP: e030:ecclog_handler+0x7e/0xf0 [igen6_edac]
  ...
  igen6_probe+0x2a0/0x343 [igen6_edac]
  ...
  igen6_init+0xc5/0xff0 [igen6_edac]
  ...

This issue occurred because one memory controller was disabled by
the BIOS but the igen6_edac driver still checked all the memory
controllers, including this absent one, to identify the source of
the error. Accessing the null MMIO for the absent memory controller
resulted in the oops above.

Fix this issue by reverting the configuration structure to non-const
and updating the field 'res_cfg->num_imc' to reflect the number of
detected memory controllers.

Fixes: 20e190b1c1fd ("EDAC/igen6: Skip absent memory controllers")
Reported-by: Marek Marczykowski-Górecki 
Closes: https://lore.kernel.org/all/aFFN7RlXkaK_loQb@mail-itl/
Suggested-by: Borislav Petkov 
Signed-off-by: Qiuxu Zhuo 
Signed-off-by: Tony Luck 
Signed-off-by: Borislav Petkov (AMD) 
Tested-by: Marek Marczykowski-Górecki 
Link: https://lore.kernel.org/r/20250618162307.1523736-1-qiuxu.zhuo@intel.com
Signed-off-by: Greg Kroah-Hartman

EDAC/amd64: Correct number of UMCs for family 19h models 70h-7fh

2025-06-27T10:13:40+00:00

commit b2e673ae53ef4b943f68585207a5f21cfc9a0714 upstream.

AMD's Family 19h-based Models 70h-7fh support 4 unified memory controllers
(UMC) per processor die.

The amd64_edac driver, however, assumes only 2 UMCs are supported since
max_mcs variable for the models has not been explicitly set to 4. The same
results in incomplete or incorrect memory information being logged to dmesg by
the module during initialization in some instances.

Fixes: 6c79e42169fe ("EDAC/amd64: Add support for ECC on family 19h model 60h-7Fh")
Closes: https://lore.kernel.org/all/27dc093f-ce27-4c71-9e81-786150a040b6@reox.at/
Reported-by: reox 
Signed-off-by: Avadhut Naik 
Signed-off-by: Borislav Petkov (AMD) 
Cc: stable@kernel.org
Link: https://lore.kernel.org/20250613005233.2330627-1-avadhut.naik@amd.com
Signed-off-by: Greg Kroah-Hartman

EDAC/igen6: Skip absent memory controllers

2025-06-27T10:13:11+00:00

[ Upstream commit 20e190b1c1fd88b21cc5106c12cfe6def5ab849d ]

Some BIOS versions may fuse off certain memory controllers and set the
registers of these absent memory controllers to ~0. The current igen6_edac
mistakenly enumerates these absent memory controllers and registers them
with the EDAC core.

Skip the absent memory controllers to avoid mistakenly enumerating them.

Signed-off-by: Qiuxu Zhuo 
Signed-off-by: Tony Luck 
Link: https://lore.kernel.org/r/20250408132455.489046-2-qiuxu.zhuo@intel.com
Signed-off-by: Sasha Levin

EDAC/altera: Use correct write width with the INTTEST register

2025-06-27T10:13:05+00:00

commit e5ef4cd2a47f27c0c9d8ff6c0f63a18937c071a3 upstream.

On the SoCFPGA platform, the INTTEST register supports only 16-bit writes.
A 32-bit write triggers an SError to the CPU so do 16-bit accesses only.

  [ bp: AI-massage the commit message. ]

Fixes: c7b4be8db8bc ("EDAC, altera: Add Arria10 OCRAM ECC support")
Signed-off-by: Niravkumar L Rabara 
Signed-off-by: Matthew Gerlach 
Signed-off-by: Borislav Petkov (AMD) 
Acked-by: Dinh Nguyen 
Cc: stable@kernel.org
Link: https://lore.kernel.org/20250527145707.25458-1-matthew.gerlach@altera.com
Signed-off-by: Greg Kroah-Hartman

EDAC/bluefield: Don't use bluefield_edac_readl() result on error

2025-06-19T13:39:27+00:00

[ Upstream commit ea3b0b7f541b9511abe2b89547c95458804f38e2 ]

The bluefield_edac_readl() routine returns an uninitialized result on error
paths. In those cases the calling routine should not use the uninitialized
result. The driver should simply log the error, and then return early.

Fixes: e41967575474 ("EDAC/bluefield: Use Arm SMC for EMI access on BlueField-2")
Signed-off-by: David Thompson 
Signed-off-by: Borislav Petkov (AMD) 
Reviewed-by: Shravan Kumar Ramani 
Link: https://lore.kernel.org/20250318214747.12271-1-davthompson@nvidia.com
Signed-off-by: Sasha Levin

EDAC/{skx_common,i10nm}: Fix the loss of saved RRL for HBM pseudo channel 0

2025-06-19T13:39:23+00:00

[ Upstream commit eeed3e03f4261e5e381a72ae099ff00ccafbb437 ]

When enabling the retry_rd_err_log (RRL) feature during the loading of the
i10nm_edac driver with the module parameter retry_rd_err_log=2 (Linux RRL
control mode), the default values of the control bits of RRL are saved so
that they can be restored during the unloading of the driver.

In the current code, the RRL of pseudo channel 1 of HBM overwrites pseudo
channel 0 during the loading of the driver, resulting in the loss of saved
RRL for pseudo channel 0. This causes the RRL of pseudo channel 0 of HBM to
be wrongly restored with the values from pseudo channel 1 when unloading
the driver.

Fix this issue by creating two separate groups of RRL control registers
per channel to save default RRL settings of two {sub-,pseudo-}channels.

Fixes: acd4cf68fefe ("EDAC/i10nm: Retrieve and print retry_rd_err_log registers for HBM")
Signed-off-by: Qiuxu Zhuo 
Signed-off-by: Tony Luck 
Tested-by: Feng Xu 
Link: https://lore.kernel.org/r/20250417150724.1170168-3-qiuxu.zhuo@intel.com
Signed-off-by: Sasha Levin

EDAC/skx_common: Fix general protection fault

2025-06-19T13:39:23+00:00

[ Upstream commit 20d2d476b3ae18041be423671a8637ed5ffd6958 ]

After loading i10nm_edac (which automatically loads skx_edac_common), if
unload only i10nm_edac, then reload it and perform error injection testing,
a general protection fault may occur:

  mce: [Hardware Error]: Machine check events logged
  Oops: general protection fault ...
  ...
  Workqueue: events mce_gen_pool_process
  RIP: 0010:string+0x53/0xe0
  ...
  Call Trace:
  
  ? die_addr+0x37/0x90
  ? exc_general_protection+0x1e7/0x3f0
  ? asm_exc_general_protection+0x26/0x30
  ? string+0x53/0xe0
  vsnprintf+0x23e/0x4c0
  snprintf+0x4d/0x70
  skx_adxl_decode+0x16a/0x330 [skx_edac_common]
  skx_mce_check_error.part.0+0xf8/0x220 [skx_edac_common]
  skx_mce_check_error+0x17/0x20 [skx_edac_common]
  ...

The issue arose was because the variable 'adxl_component_count' (inside
skx_edac_common), which counts the ADXL components, was not reset. During
the reloading of i10nm_edac, the count was incremented by the actual number
of ADXL components again, resulting in a count that was double the real
number of ADXL components. This led to an out-of-bounds reference to the
ADXL component array, causing the general protection fault above.

Fix this issue by resetting the 'adxl_component_count' in adxl_put(),
which is called during the unloading of {skx,i10nm}_edac.

Fixes: 123b15863550 ("EDAC, i10nm: make skx_common.o a separate module")
Reported-by: Feng Xu 
Signed-off-by: Qiuxu Zhuo 
Signed-off-by: Tony Luck 
Tested-by: Feng Xu 
Link: https://lore.kernel.org/r/20250417150724.1170168-2-qiuxu.zhuo@intel.com
Signed-off-by: Sasha Levin