linux-stable.git/arch/x86/kernel/cpu/mcheck, branch linux-4.9.y

x86/mce/amd: Fix kobject lifetime

2020-02-28T14:42:48+00:00

commit 51dede9c05df2b78acd6dcf6a17d21f0877d2d7b upstream.

Accessing the MCA thresholding controls in sysfs concurrently with CPU
hotplug can lead to a couple of KASAN-reported issues:

  BUG: KASAN: use-after-free in sysfs_file_ops+0x155/0x180
  Read of size 8 at addr ffff888367578940 by task grep/4019

and

  BUG: KASAN: use-after-free in show_error_count+0x15c/0x180
  Read of size 2 at addr ffff888368a05514 by task grep/4454

for example. Both result from the fact that the threshold block
creation/teardown code frees the descriptor memory itself instead of
defining proper ->release function and leaving it to the driver core to
take care of that, after all sysfs accesses have completed.

Do that and get rid of the custom freeing code, fixing the above UAFs in
the process.

  [ bp: write commit message. ]

Fixes: 95268664390b ("[PATCH] x86_64: mce_amd support for family 0x10 processors")
Signed-off-by: Thomas Gleixner 
Signed-off-by: Borislav Petkov 
Cc: 
Link: https://lkml.kernel.org/r/20200214082801.13836-1-bp@alien8.de
Signed-off-by: Greg Kroah-Hartman

x86/mce/amd: Publish the bank pointer only after setup has succeeded

2020-02-28T14:42:47+00:00

commit 6e5cf31fbe651bed7ba1df768f2e123531132417 upstream.

threshold_create_bank() creates a bank descriptor per MCA error
thresholding counter which can be controlled over sysfs. It publishes
the pointer to that bank in a per-CPU variable and then goes on to
create additional thresholding blocks if the bank has such.

However, that creation of additional blocks in
allocate_threshold_blocks() can fail, leading to a use-after-free
through the per-CPU pointer.

Therefore, publish that pointer only after all blocks have been setup
successfully.

Fixes: 019f34fccfd5 ("x86, MCE, AMD: Move shared bank to node descriptor")
Reported-by: Saar Amar 
Reported-by: Dan Carpenter 
Signed-off-by: Borislav Petkov 
Cc: 
Link: http://lkml.kernel.org/r/20200128140846.phctkvx5btiexvbx@kili.mountain
Signed-off-by: Greg Kroah-Hartman

x86/mce: Fix possibly incorrect severity calculation on AMD

2020-01-04T12:41:04+00:00

[ Upstream commit a3a57ddad061acc90bef39635caf2b2330ce8f21 ]

The function mce_severity_amd_smca() requires m->bank to be initialized
for correct operation. Fix the one case, where mce_severity() is called
without doing so.

Fixes: 6bda529ec42e ("x86/mce: Grade uncorrected errors for SMCA-enabled systems")
Fixes: d28af26faa0b ("x86/MCE: Initialize mce.bank in the case of a fatal error in mce_no_way_out()")
Signed-off-by: Jan H. Schönherr 
Signed-off-by: Borislav Petkov 
Reviewed-by: Tony Luck 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: linux-edac 
Cc: 
Cc: Thomas Gleixner 
Cc: x86-ml 
Cc: Yazen Ghannam 
Link: https://lkml.kernel.org/r/20191210000733.17979-4-jschoenh@amazon.de
Signed-off-by: Sasha Levin

x86/mce: Lower throttling MCE messages' priority to warning

2020-01-04T12:39:27+00:00

[ Upstream commit 9c3bafaa1fd88e4dd2dba3735a1f1abb0f2c7bb7 ]

On modern CPUs it is quite normal that the temperature limits are
reached and the CPU is throttled. In fact, often the thermal design is
not sufficient to cool the CPU at full load and limits can quickly be
reached when a burst in load happens. This will even happen with
technologies like RAPL limitting the long term power consumption of
the package.

Also, these limits are "softer", as Srinivas explains:

"CPU temperature doesn't have to hit max(TjMax) to get these warnings.
OEMs ha[ve] an ability to program a threshold where a thermal interrupt
can be generated. In some systems the offset is 20C+ (Read only value).

In recent systems, there is another offset on top of it which can be
programmed by OS, once some agent can adjust power limits dynamically.
By default this is set to low by the firmware, which I guess the
prime motivation of Benjamin to submit the patch."

So these messages do not usually indicate a hardware issue (e.g.
insufficient cooling). Log them as warnings to avoid confusion about
their severity.

 [ bp: Massage commit mesage. ]

Signed-off-by: Benjamin Berg 
Signed-off-by: Borislav Petkov 
Reviewed-by: Hans de Goede 
Tested-by: Christian Kellner 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: linux-edac 
Cc: Peter Zijlstra 
Cc: Srinivas Pandruvada 
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20191009155424.249277-1-bberg@redhat.com
Signed-off-by: Sasha Levin

x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk

2019-12-21T09:42:14+00:00

[ Upstream commit 30aa3d26edb0f3d7992757287eec0ca588a5c259 ]

The MC4_MISC thresholding quirk needs to be applied during S5 -> S0 and
S3 -> S0 state transitions, which follow different code paths. Carve it
out into a separate function and call it mce_amd_feature_init() where
the two code paths of the state transitions converge.

 [ bp: massage commit message and the carved out function. ]

Signed-off-by: Shirish S 
Signed-off-by: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: Vishal Verma 
Cc: Yazen Ghannam 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/1547651417-23583-3-git-send-email-shirish.s@amd.com
Signed-off-by: Sasha Levin

x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models

2019-12-21T09:42:14+00:00

[ Upstream commit c95b323dcd3598dd7ef5005d6723c1ba3b801093 ]

MC4_MISC thresholding is not supported on all family 0x15 processors,
hence skip the x86_model check when applying the quirk.

 [ bp: massage commit message. ]

Signed-off-by: Shirish S 
Signed-off-by: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: Vishal Verma 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/1547106849-3476-2-git-send-email-shirish.s@amd.com
Signed-off-by: Sasha Levin

x86/mce: Fix machine_check_poll() tests for error types

2019-05-31T13:48:30+00:00

[ Upstream commit f19501aa07f18268ab14f458b51c1c6b7f72a134 ]

There has been a lurking "TBD" in the machine check poll routine ever
since it was first split out from the machine check handler. The
potential issue is that the poll routine may have just begun a read from
the STATUS register in a machine check bank when the hardware logs an
error in that bank and signals a machine check.

That race used to be pretty small back when machine checks were
broadcast, but the addition of local machine check means that the poll
code could continue running and clear the error from the bank before the
local machine check handler on another CPU gets around to reading it.

Fix the code to be sure to only process errors that need to be processed
in the poll code, leaving other logged errors alone for the machine
check handler to find and process.

 [ bp: Massage a bit and flip the "== 0" check to the usual !(..) test. ]

Fixes: b79109c3bbcf ("x86, mce: separate correct machine check poller and fatal exception handler")
Fixes: ed7290d0ee8f ("x86, mce: implement new status bits")
Reported-by: Ashok Raj 
Signed-off-by: Tony Luck 
Signed-off-by: Borislav Petkov 
Cc: Ashok Raj 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: linux-edac 
Cc: Thomas Gleixner 
Cc: x86-ml 
Cc: Yazen Ghannam 
Link: https://lkml.kernel.org/r/20190312170938.GA23035@agluck-desk
Signed-off-by: Sasha Levin

x86/MCE: Save microcode revision in machine check records

2019-05-14T17:19:32+00:00

commit fa94d0c6e0f3431523f5701084d799c77c7d4a4f upstream.

Updating microcode used to be relatively rare. Now that it has become
more common we should save the microcode version in a machine check
record to make sure that those people looking at the error have this
important information bundled with the rest of the logged information.

[ Borislav: Simplify a bit. ]

Signed-off-by: Tony Luck 
Signed-off-by: Borislav Petkov 
Signed-off-by: Thomas Gleixner 
Cc: Yazen Ghannam 
Cc: linux-edac 
Link: http://lkml.kernel.org/r/20180301233449.24311-1-tony.luck@intel.com
[bwh: Backported to 4.9:
 - Also add ppin field to struct mce, to match upstream UAPI
 - Adjust context]
Signed-off-by: Ben Hutchings 
Signed-off-by: Greg Kroah-Hartman

x86/mce: Improve error message when kernel cannot recover, p2

2019-05-08T05:19:12+00:00

commit 41f035a86b5b72a4f947c38e94239d20d595352a upstream.

In

  c7d606f560e4 ("x86/mce: Improve error message when kernel cannot recover")

a case was added for a machine check caused by a DATA access to poison
memory from the kernel. A case should have been added also for an
uncorrectable error during an instruction fetch in the kernel.

Add that extra case so the error message now reads:

  mce: [Hardware Error]: Machine check: Instruction fetch error in kernel

Fixes: c7d606f560e4 ("x86/mce: Improve error message when kernel cannot recover")
Signed-off-by: Tony Luck 
Signed-off-by: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Pu Wen 
Cc: Thomas Gleixner 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190225205940.15226-1-tony.luck@intel.com
Signed-off-by: Greg Kroah-Hartman

x86/MCE: Initialize mce.bank in the case of a fatal error in mce_no_way_out()

2019-02-12T18:45:01+00:00

commit d28af26faa0b1daf3c692603d46bc4687c16f19e upstream.

Internal injection testing crashed with a console log that said:

  mce: [Hardware Error]: CPU 7: Machine Check Exception: f Bank 0: bd80000000100134

This caused a lot of head scratching because the MCACOD (bits 15:0) of
that status is a signature from an L1 data cache error. But Linux says
that it found it in "Bank 0", which on this model CPU only reports L1
instruction cache errors.

The answer was that Linux doesn't initialize "m->bank" in the case that
it finds a fatal error in the mce_no_way_out() pre-scan of banks. If
this was a local machine check, then this partially initialized struct
mce is being passed to mce_panic().

Fix is simple: just initialize m->bank in the case of a fatal error.

Fixes: 40c36e2741d7 ("x86/mce: Fix incorrect "Machine check from unknown source" message")
Signed-off-by: Tony Luck 
Signed-off-by: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: Vishal Verma 
Cc: x86-ml 
Cc: stable@vger.kernel.org # v4.18 Note pre-v5.0 arch/x86/kernel/cpu/mce/core.c was called arch/x86/kernel/cpu/mcheck/mce.c
Link: https://lkml.kernel.org/r/20190201003341.10638-1-tony.luck@intel.com
Signed-off-by: Greg Kroah-Hartman