linux.git/arch/x86/kernel/cpu/mcheck, branch v3.10

Merge tag 'please-pull-cmci_rediscover' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/ras

2013-04-08T15:41:50+00:00

Pull clean up of the cmci_rediscover code to fix problems found by Dave Jones,
from Tony Luck.

Signed-off-by: Ingo Molnar

x86/mce: Rework cmci_rediscover() to play well with CPU hotplug

2013-04-02T21:04:01+00:00

Dave Jones reports that offlining a CPU leads to this trace:

numa_remove_cpu cpu 1 node 0: mask now 0,2-3
smpboot: CPU 1 is now offline
BUG: using smp_processor_id() in preemptible [00000000] code:
cpu-offline.sh/10591
caller is cmci_rediscover+0x6a/0xe0
Pid: 10591, comm: cpu-offline.sh Not tainted 3.9.0-rc3+ #2
Call Trace:
 [] debug_smp_processor_id+0xdd/0x100
 [] cmci_rediscover+0x6a/0xe0
 [] mce_cpu_callback+0x19d/0x1ae
 [] notifier_call_chain+0x66/0x150
 [] __raw_notifier_call_chain+0xe/0x10
 [] cpu_notify+0x23/0x50
 [] cpu_notify_nofail+0xe/0x20
 [] _cpu_down+0x302/0x350
 [] cpu_down+0x36/0x50
 [] store_online+0x8d/0xd0
 [] dev_attr_store+0x18/0x30
 [] sysfs_write_file+0xdb/0x150
 [] vfs_write+0xa2/0x170
 [] sys_write+0x4c/0xa0
 [] system_call_fastpath+0x16/0x1b

However, a look at cmci_rediscover shows that it can be simplified quite
a bit, apart from solving the above issue. It invokes functions that
take spin locks with interrupts disabled, and hence it can run in atomic
context. Also, it is run in the CPU_POST_DEAD phase, so the dying CPU
is already dead and out of the cpu_online_mask. So take these points into
account and simplify the code, and thereby also fix the above issue.

Reported-by: Dave Jones 
Signed-off-by: Srivatsa S. Bhat 
Signed-off-by: Tony Luck

x86, MCE, AMD: Use MCG_CAP MSR to find out number of banks on AMD

2013-03-22T10:25:01+00:00

Currently number of error reporting register banks is hardcoded to
6 on AMD processors. This may break in virtualized scenarios when
a hypervisor prefers to report fewer banks than what the physical
HW provides.

Since number of supported banks is reported in MSR_IA32_MCG_CAP[7:0]
that's what we should use.

Signed-off-by: Boris Ostrovsky 
Link: http://lkml.kernel.org/r/1363295441-1859-3-git-send-email-boris.ostrovsky@oracle.com
[ reverse NULL ptr test logic ]
Signed-off-by: Borislav Petkov

x86, MCE, AMD: Replace shared_bank array with is_shared_bank() helper

2013-03-22T10:25:01+00:00

Use helper function instead of an array to report whether register
bank is shared. Currently only bank 4 (northbridge) is shared.

Signed-off-by: Boris Ostrovsky 
Link: http://lkml.kernel.org/r/1363295441-1859-2-git-send-email-boris.ostrovsky@oracle.com
Signed-off-by: Borislav Petkov

Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

2013-02-25T23:41:43+00:00

Pull module update from Rusty Russell:
 "The sweeping change is to make add_taint() explicitly indicate whether
  to disable lockdep, but it's a mechanical change."

* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  MODSIGN: Add option to not sign modules during modules_install
  MODSIGN: Add -s  option to sign-file
  MODSIGN: Specify the hash algorithm on sign-file command line
  MODSIGN: Simplify Makefile with a Kconfig helper
  module: clean up load_module a little more.
  modpost: Ignore ARC specific non-alloc sections
  module: constify within_module_*
  taint: add explicit flag to show whether lock dep is still OK.
  module: printk message when module signature fail taints kernel.

taint: add explicit flag to show whether lock dep is still OK.

2013-01-21T06:47:57+00:00

Fix up all callers as they were before, with make one change: an
unsigned module taints the kernel, but doesn't turn off lockdep.

Signed-off-by: Rusty Russell

x86/mce: don't use [delayed_]work_pending()

2012-12-28T21:40:16+00:00

There's no need to test whether a (delayed) work item in pending
before queueing, flushing or cancelling it.  Most uses are unnecessary
and quite a few of them are buggy.

Remove unnecessary pending tests from x86/mce.  Only compile tested.

v2: Local var work removed from mce_schedule_work() as suggested by
    Borislav.

Signed-off-by: Tejun Heo 
Acked-by: Borislav Petkov 
Cc: Tony Luck 
Cc: linux-edac@vger.kernel.org

Merge branch 'x86-ras-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2012-12-14T17:59:59+00:00

Pull x86 RAS update from Ingo Molnar:
 "Rework all config variables used throughout the MCA code and collect
  them together into a mca_config struct.  This keeps them tightly and
  neatly packed together instead of spilled all over the place.

  Then, convert those which are used as booleans into real booleans and
  save some space.  These bits are exposed via
     /sys/devices/system/machinecheck/machinecheck*/"

* 'x86-ras-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, MCA: Finish mca_config conversion
  x86, MCA: Convert the next three variables batch
  x86, MCA: Convert rip_msr, mce_bootlog, monarch_timeout
  x86, MCA: Convert dont_log_ce, banks and tolerant
  drivers/base: Add a DEVICE_BOOL_ATTR macro

Merge tag 'please-pull-tangchen' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/urgent

2012-11-13T18:01:01+00:00

Pull MCE fix from Tony Luck:

   "Fix problem in CMCI rediscovery code that was illegally
    migrating worker threads to other cpus."

Signed-off-by: Ingo Molnar

x86/mce: Do not change worker's running cpu in cmci_rediscover().

2012-10-30T21:38:12+00:00

cmci_rediscover() used set_cpus_allowed_ptr() to change the current process's
running cpu, and migrate itself to the dest cpu. But worker processes are not
allowed to be migrated. If current is a worker, the worker will be migrated to
another cpu, but the corresponding  worker_pool is still on the original cpu.

In this case, the following BUG_ON in try_to_wake_up_local() will be triggered:
BUG_ON(rq != this_rq());

This will cause the kernel panic. The call trace is like the following:

[ 6155.451107] ------------[ cut here ]------------
[ 6155.452019] kernel BUG at kernel/sched/core.c:1654!
......
[ 6155.452019] RIP: 0010:[]  [] try_to_wake_up_local+0x115/0x130
......
[ 6155.452019] Call Trace:
[ 6155.452019]  [] __schedule+0x764/0x880
[ 6155.452019]  [] schedule+0x29/0x70
[ 6155.452019]  [] schedule_timeout+0x235/0x2d0
[ 6155.452019]  [] ? mark_held_locks+0x8d/0x140
[ 6155.452019]  [] ? __lock_release+0x133/0x1a0
[ 6155.452019]  [] ? _raw_spin_unlock_irq+0x30/0x50
[ 6155.452019]  [] ? trace_hardirqs_on_caller+0x105/0x190
[ 6155.452019]  [] wait_for_common+0x12b/0x180
[ 6155.452019]  [] ? try_to_wake_up+0x2f0/0x2f0
[ 6155.452019]  [] wait_for_completion+0x1d/0x20
[ 6155.452019]  [] stop_one_cpu+0x8a/0xc0
[ 6155.452019]  [] ? __migrate_task+0x1a0/0x1a0
[ 6155.452019]  [] ? complete+0x28/0x60
[ 6155.452019]  [] set_cpus_allowed_ptr+0x128/0x130
[ 6155.452019]  [] cmci_rediscover+0xf5/0x140
[ 6155.452019]  [] mce_cpu_callback+0x18d/0x19d
[ 6155.452019]  [] notifier_call_chain+0x67/0x150
[ 6155.452019]  [] __raw_notifier_call_chain+0xe/0x10
[ 6155.452019]  [] __cpu_notify+0x20/0x40
[ 6155.452019]  [] cpu_notify_nofail+0x15/0x30
[ 6155.452019]  [] _cpu_down+0x262/0x2e0
[ 6155.452019]  [] cpu_down+0x36/0x50
[ 6155.452019]  [] acpi_processor_remove+0x50/0x11e
[ 6155.452019]  [] acpi_device_remove+0x90/0xb2
[ 6155.452019]  [] __device_release_driver+0x7c/0xf0
[ 6155.452019]  [] device_release_driver+0x2f/0x50
[ 6155.452019]  [] acpi_bus_remove+0x32/0x6d
[ 6155.452019]  [] acpi_bus_trim+0x87/0xee
[ 6155.452019]  [] acpi_bus_hot_remove_device+0x88/0x16b
[ 6155.452019]  [] acpi_os_execute_deferred+0x27/0x34
[ 6155.452019]  [] process_one_work+0x219/0x680
[ 6155.452019]  [] ? process_one_work+0x1b8/0x680
[ 6155.452019]  [] ? acpi_os_wait_events_complete+0x23/0x23
[ 6155.452019]  [] worker_thread+0x12e/0x320
[ 6155.452019]  [] ? manage_workers+0x110/0x110
[ 6155.452019]  [] kthread+0xc6/0xd0
[ 6155.452019]  [] kernel_thread_helper+0x4/0x10
[ 6155.452019]  [] ? retint_restore_args+0x13/0x13
[ 6155.452019]  [] ? __init_kthread_worker+0x70/0x70
[ 6155.452019]  [] ? gs_change+0x13/0x13

This patch removes the set_cpus_allowed_ptr() call, and put the cmci rediscover
jobs onto all the other cpus using system_wq. This could bring some delay for
the jobs.

Signed-off-by: Tang Chen 
Signed-off-by: Miao Xie 
Signed-off-by: Tony Luck