linux-stable.git/arch/x86/kernel/cpu/mce, branch linux-5.0.y

x86/mce: Handle varying MCA bank counts

2019-05-31T13:45:16+00:00

[ Upstream commit 006c077041dc73b9490fffc4c6af5befe0687110 ]

Linux reads MCG_CAP[Count] to find the number of MCA banks visible to a
CPU. Currently, this number is the same for all CPUs and a warning is
shown if there is a difference. The number of banks is overwritten with
the MCG_CAP[Count] value of each following CPU that boots.

According to the Intel SDM and AMD APM, the MCG_CAP[Count] value gives
the number of banks that are available to a "processor implementation".
The AMD BKDGs/PPRs further clarify that this value is per core. This
value has historically been the same for every core in the system, but
that is not an architectural requirement.

Future AMD systems may have different MCG_CAP[Count] values per core,
so the assumption that all CPUs will have the same MCG_CAP[Count] value
will no longer be valid.

Also, the first CPU to boot will allocate the struct mce_banks[] array
using the number of banks based on its MCG_CAP[Count] value. The machine
check handler and other functions use the global number of banks to
iterate and index into the mce_banks[] array. So it's possible to use an
out-of-bounds index on an asymmetric system where a following CPU sees a
MCG_CAP[Count] value greater than its predecessors.

Thus, allocate the mce_banks[] array to the maximum number of banks.
This will avoid the potential out-of-bounds index since the value of
mca_cfg.banks is capped to MAX_NR_BANKS.

Set the value of mca_cfg.banks equal to the max of the previous value
and the value for the current CPU. This way mca_cfg.banks will always
represent the max number of banks detected on any CPU in the system.

This will ensure that all CPUs will access all the banks that are
visible to them. A CPU that can access fewer than the max number of
banks will find the registers of the extra banks to be read-as-zero.

Furthermore, print the resulting number of MCA banks in use. Do this in
mcheck_late_init() so that the final value is printed after all CPUs
have been initialized.

Finally, get bank count from target CPU when doing injection with mce-inject
module.

 [ bp: Remove out-of-bounds example, passify and cleanup commit message. ]

Signed-off-by: Yazen Ghannam 
Signed-off-by: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: linux-edac 
Cc: Pu Wen 
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: Vishal Verma 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20180727214009.78289-1-Yazen.Ghannam@amd.com
Signed-off-by: Sasha Levin

x86/mce: Fix machine_check_poll() tests for error types

2019-05-31T13:45:15+00:00

[ Upstream commit f19501aa07f18268ab14f458b51c1c6b7f72a134 ]

There has been a lurking "TBD" in the machine check poll routine ever
since it was first split out from the machine check handler. The
potential issue is that the poll routine may have just begun a read from
the STATUS register in a machine check bank when the hardware logs an
error in that bank and signals a machine check.

That race used to be pretty small back when machine checks were
broadcast, but the addition of local machine check means that the poll
code could continue running and clear the error from the bank before the
local machine check handler on another CPU gets around to reading it.

Fix the code to be sure to only process errors that need to be processed
in the poll code, leaving other logged errors alone for the machine
check handler to find and process.

 [ bp: Massage a bit and flip the "== 0" check to the usual !(..) test. ]

Fixes: b79109c3bbcf ("x86, mce: separate correct machine check poller and fatal exception handler")
Fixes: ed7290d0ee8f ("x86, mce: implement new status bits")
Reported-by: Ashok Raj 
Signed-off-by: Tony Luck 
Signed-off-by: Borislav Petkov 
Cc: Ashok Raj 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: linux-edac 
Cc: Thomas Gleixner 
Cc: x86-ml 
Cc: Yazen Ghannam 
Link: https://lkml.kernel.org/r/20190312170938.GA23035@agluck-desk
Signed-off-by: Sasha Levin

x86/MCE/AMD: Don't report L1 BTB MCA errors on some family 17h models

2019-05-22T05:38:37+00:00

commit 71a84402b93e5fbd8f817f40059c137e10171788 upstream.

AMD family 17h Models 10h-2Fh may report a high number of L1 BTB MCA
errors under certain conditions. The errors are benign and can safely be
ignored. However, the high error rate may cause the MCA threshold
counter to overflow causing a high rate of thresholding interrupts.

In addition, users may see the errors reported through the AMD MCE
decoder module, even with the interrupt disabled, due to MCA polling.

Clear the "Counter Present" bit in the Instruction Fetch bank's
MCA_MISC0 register. This will prevent enabling MCA thresholding on this
bank which will prevent the high interrupt rate due to this error.

Define an AMD-specific function to filter these errors from the MCE
event pool so that they don't get reported during early boot.

Rename filter function in EDAC/mce_amd to avoid a naming conflict, while
at it.

 [ bp: Move function prototype to the internal header and
   massage/cleanup, fix typos. ]

Reported-by: Rafał Miłecki 
Signed-off-by: Yazen Ghannam 
Signed-off-by: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: "clemej@gmail.com" 
Cc: Arnd Bergmann 
Cc: Ingo Molnar 
Cc: James Morse 
Cc: Kees Cook 
Cc: Mauro Carvalho Chehab 
Cc: Pu Wen 
Cc: Qiuxu Zhuo 
Cc: Shirish S 
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: Vishal Verma 
Cc: linux-edac 
Cc: x86-ml 
Cc:  # 5.0.x: c95b323dcd35: x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models
Cc:  # 5.0.x: 30aa3d26edb0: x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk
Cc:  # 5.0.x: 9308fd407455: x86/MCE: Group AMD function prototypes in 
Cc:  # 5.0.x
Link: https://lkml.kernel.org/r/20190325163410.171021-2-Yazen.Ghannam@amd.com
Signed-off-by: Greg Kroah-Hartman

x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk

2019-05-22T05:38:37+00:00

commit 30aa3d26edb0f3d7992757287eec0ca588a5c259 upstream.

The MC4_MISC thresholding quirk needs to be applied during S5 -> S0 and
S3 -> S0 state transitions, which follow different code paths. Carve it
out into a separate function and call it mce_amd_feature_init() where
the two code paths of the state transitions converge.

 [ bp: massage commit message and the carved out function. ]

Signed-off-by: Shirish S 
Signed-off-by: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: Vishal Verma 
Cc: Yazen Ghannam 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/1547651417-23583-3-git-send-email-shirish.s@amd.com
Signed-off-by: Greg Kroah-Hartman

x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models

2019-05-22T05:38:36+00:00

commit c95b323dcd3598dd7ef5005d6723c1ba3b801093 upstream.

MC4_MISC thresholding is not supported on all family 0x15 processors,
hence skip the x86_model check when applying the quirk.

 [ bp: massage commit message. ]

Signed-off-by: Shirish S 
Signed-off-by: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: Vishal Verma 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/1547106849-3476-2-git-send-email-shirish.s@amd.com
Signed-off-by: Greg Kroah-Hartman

x86/MCE: Add an MCE-record filtering function

2019-05-22T05:38:36+00:00

commit 45d4b7b9cb88526f6d5bd4c03efab88d75d10e4f upstream.

Some systems may report spurious MCA errors. In general, spurious MCA
errors may be disabled by clearing a particular bit in MCA_CTL. However,
clearing a bit in MCA_CTL may not be recommended for some errors, so the
only option is to ignore them.

An MCA error is printed and handled after it has been added to the MCE
event pool. So an MCA error can be ignored by not adding it to that pool
in the first place.

Add such a filtering function.

 [ bp: Move function prototype to the internal header and massage. ]

Signed-off-by: Yazen Ghannam 
Signed-off-by: Borislav Petkov 
Cc: Arnd Bergmann 
Cc: "clemej@gmail.com" 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Pu Wen 
Cc: Qiuxu Zhuo 
Cc: "rafal@milecki.pl" 
Cc: Shirish S 
Cc:  # 5.0.x
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: Vishal Verma 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190325163410.171021-1-Yazen.Ghannam@amd.com
Signed-off-by: Greg Kroah-Hartman

x86/mce: Improve error message when kernel cannot recover, p2

2019-05-08T05:22:59+00:00

commit 41f035a86b5b72a4f947c38e94239d20d595352a upstream.

In

  c7d606f560e4 ("x86/mce: Improve error message when kernel cannot recover")

a case was added for a machine check caused by a DATA access to poison
memory from the kernel. A case should have been added also for an
uncorrectable error during an instruction fetch in the kernel.

Add that extra case so the error message now reads:

  mce: [Hardware Error]: Machine check: Instruction fetch error in kernel

Fixes: c7d606f560e4 ("x86/mce: Improve error message when kernel cannot recover")
Signed-off-by: Tony Luck 
Signed-off-by: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Pu Wen 
Cc: Thomas Gleixner 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190225205940.15226-1-tony.luck@intel.com
Signed-off-by: Greg Kroah-Hartman

x86/MCE: Initialize mce.bank in the case of a fatal error in mce_no_way_out()

2019-02-03T12:24:24+00:00

Internal injection testing crashed with a console log that said:

  mce: [Hardware Error]: CPU 7: Machine Check Exception: f Bank 0: bd80000000100134

This caused a lot of head scratching because the MCACOD (bits 15:0) of
that status is a signature from an L1 data cache error. But Linux says
that it found it in "Bank 0", which on this model CPU only reports L1
instruction cache errors.

The answer was that Linux doesn't initialize "m->bank" in the case that
it finds a fatal error in the mce_no_way_out() pre-scan of banks. If
this was a local machine check, then this partially initialized struct
mce is being passed to mce_panic().

Fix is simple: just initialize m->bank in the case of a fatal error.

Fixes: 40c36e2741d7 ("x86/mce: Fix incorrect "Machine check from unknown source" message")
Signed-off-by: Tony Luck 
Signed-off-by: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: Vishal Verma 
Cc: x86-ml 
Cc: stable@vger.kernel.org # v4.18 Note pre-v5.0 arch/x86/kernel/cpu/mce/core.c was called arch/x86/kernel/cpu/mcheck/mce.c
Link: https://lkml.kernel.org/r/20190201003341.10638-1-tony.luck@intel.com

Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2018-12-27T01:03:51+00:00

Pull x86 cleanups from Ingo Molnar:
 "Misc cleanups"

* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/kprobes: Remove trampoline_handler() prototype
  x86/kernel: Fix more -Wmissing-prototypes warnings
  x86: Fix various typos in comments
  x86/headers: Fix -Wmissing-prototypes warning
  x86/process: Avoid unnecessary NULL check in get_wchan()
  x86/traps: Complete prototype declarations
  x86/mce: Fix -Wmissing-prototypes warnings
  x86/gart: Rewrite early_gart_iommu_check() comment

x86/mce: Restore MCE injector's module name

2018-12-18T23:04:36+00:00

It was mce-inject.ko but it turned into inject.ko since the containing
source file got renamed. Restore it.

Fixes: 21afaf181362 ("x86/mce: Streamline MCE subsystem's naming")
Signed-off-by: Borislav Petkov 
Signed-off-by: Thomas Gleixner 
Cc: linux-edac 
Cc: Tony Luck 
Link: https://lkml.kernel.org/r/20181218182546.GA21386@zn.tnic