linux-stable.git/kernel, branch linux-4.6.y

cpu/hotplug: Keep enough storage space if SMP=n to avoid array out of bounds scribble

2016-08-10T10:54:49+00:00

commit a7c734140aa36413944eef0f8c660e0e2256357d upstream.

Xiaolong Ye reported lock debug warnings triggered by the following commit:

  8de4a0066106 ("perf/x86: Convert the core to the hotplug state machine")

The bug is the following: the cpuhp_bp_states[] array is cut short when
CONFIG_SMP=n, but the dynamically registered callbacks are stored nevertheless
and happily scribble outside of the array bounds...

We need to store them in case that the state is unregistered so we can invoke
the teardown function. That's independent of CONFIG_SMP. Make sure the array
is large enough.

Reported-by: kernel test robot 
Signed-off-by: Thomas Gleixner 
Cc: Adam Borowski 
Cc: Alexander Shishkin 
Cc: Anna-Maria Gleixner 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: Borislav Petkov 
Cc: Jiri Olsa 
Cc: Kan Liang 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Sebastian Andrzej Siewior 
Cc: Stephane Eranian 
Cc: Vince Weaver 
Cc: lkp@01.org
Cc: tipbuild@zytor.com
Fixes: cff7d378d3fd "cpu/hotplug: Convert to a state machine for the control processor"
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1607122144560.4083@nanos
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

posix_cpu_timer: Exit early when process has been reaped

2016-08-10T10:54:49+00:00

commit 2c13ce8f6b2f6fd9ba2f9261b1939fc0f62d1307 upstream.

Variable "now" seems to be genuinely used unintialized
if branch

	if (CPUCLOCK_PERTHREAD(timer->it_clock)) {

is not taken and branch

	if (unlikely(sighand == NULL)) {

is taken. In this case the process has been reaped and the timer is marked as
disarmed anyway. So none of the postprocessing of the sample is
required. Return right away.

Signed-off-by: Alexey Dobriyan 
Link: http://lkml.kernel.org/r/20160707223911.GA26483@p183.telecom.by
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman

sched/fair: Fix effective_load() to consistently use smoothed load

2016-08-10T10:54:48+00:00

commit 7dd4912594daf769a46744848b05bd5bc6d62469 upstream.

Starting with the following commit:

  fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities")

calc_tg_weight() doesn't compute the right value as expected by effective_load().

The difference is in the 'correction' term. In order to ensure \Sum
rw_j >= rw_i we cannot use tg->load_avg directly, since that might be
lagging a correction on the current cfs_rq->avg.load_avg value.
Therefore we use tg->load_avg - cfs_rq->tg_load_avg_contrib +
cfs_rq->avg.load_avg.

Now, per the referenced commit, calc_tg_weight() doesn't use
cfs_rq->avg.load_avg, as is later used in @w, but uses
cfs_rq->load.weight instead.

So stop using calc_tg_weight() and do it explicitly.

The effects of this bug are wake_affine() making randomly
poor choices in cgroup-intense workloads.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Fixes: fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities")
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

cgroup: Disable IRQs while holding css_set_lock

2016-08-10T10:54:46+00:00

commit 82d6489d0fed2ec8a8c48c19e8d8a04ac8e5bb26 upstream.

While testing the deadline scheduler + cgroup setup I hit this
warning.

[  132.612935] ------------[ cut here ]------------
[  132.612951] WARNING: CPU: 5 PID: 0 at kernel/softirq.c:150 __local_bh_enable_ip+0x6b/0x80
[  132.612952] Modules linked in: (a ton of modules...)
[  132.612981] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.7.0-rc2 #2
[  132.612981] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
[  132.612982]  0000000000000086 45c8bb5effdd088b ffff88013fd43da0 ffffffff813d229e
[  132.612984]  0000000000000000 0000000000000000 ffff88013fd43de0 ffffffff810a652b
[  132.612985]  00000096811387b5 0000000000000200 ffff8800bab29d80 ffff880034c54c00
[  132.612986] Call Trace:
[  132.612987]    [] dump_stack+0x63/0x85
[  132.612994]  [] __warn+0xcb/0xf0
[  132.612997]  [] ? push_dl_task.part.32+0x170/0x170
[  132.612999]  [] warn_slowpath_null+0x1d/0x20
[  132.613000]  [] __local_bh_enable_ip+0x6b/0x80
[  132.613008]  [] _raw_write_unlock_bh+0x1a/0x20
[  132.613010]  [] _raw_spin_unlock_bh+0xe/0x10
[  132.613015]  [] put_css_set+0x5c/0x60
[  132.613016]  [] cgroup_free+0x7f/0xa0
[  132.613017]  [] __put_task_struct+0x42/0x140
[  132.613018]  [] dl_task_timer+0xca/0x250
[  132.613027]  [] ? push_dl_task.part.32+0x170/0x170
[  132.613030]  [] __hrtimer_run_queues+0xee/0x270
[  132.613031]  [] hrtimer_interrupt+0xa8/0x190
[  132.613034]  [] local_apic_timer_interrupt+0x38/0x60
[  132.613035]  [] smp_apic_timer_interrupt+0x3d/0x50
[  132.613037]  [] apic_timer_interrupt+0x8c/0xa0
[  132.613038]    [] ? native_safe_halt+0x6/0x10
[  132.613043]  [] default_idle+0x1e/0xd0
[  132.613044]  [] arch_cpu_idle+0xf/0x20
[  132.613046]  [] default_idle_call+0x2a/0x40
[  132.613047]  [] cpu_startup_entry+0x2e7/0x340
[  132.613048]  [] start_secondary+0x155/0x190
[  132.613049] ---[ end trace f91934d162ce9977 ]---

The warn is the spin_(lock|unlock)_bh(&css_set_lock) in the interrupt
context. Converting the spin_lock_bh to spin_lock_irq(save) to avoid
this problem - and other problems of sharing a spinlock with an
interrupt.

Cc: Tejun Heo 
Cc: Li Zefan 
Cc: Johannes Weiner 
Cc: Juri Lelli 
Cc: Steven Rostedt 
Cc: cgroups@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Rik van Riel 
Reviewed-by: "Luis Claudio R. Goncalves" 
Signed-off-by: Daniel Bristot de Oliveira 
Acked-by: Zefan Li 
Signed-off-by: Tejun Heo 
Signed-off-by: Greg Kroah-Hartman

cgroup: set css->id to -1 during init

2016-08-10T10:54:46+00:00

commit 8fa3b8d689a54d6d04ff7803c724fb7aca6ce98e upstream.

If percpu_ref initialization fails during css_create(), the free path
can end up trying to free css->id of zero.  As ID 0 is unused, it
doesn't cause a critical breakage but it does trigger a warning
message.  Fix it by setting css->id to -1 from init_and_link_css().

Signed-off-by: Tejun Heo 
Cc: Wenwei Tao 
Fixes: 01e586598b22 ("cgroup: release css->id after css_free")
Signed-off-by: Tejun Heo 
Signed-off-by: Greg Kroah-Hartman

cgroup: remove redundant cleanup in css_create

2016-08-10T10:54:46+00:00

commit b00c52dae6d9ee8d0f2407118ef6544ae5524781 upstream.

When create css failed, before call css_free_rcu_fn, we remove the css
id and exit the percpu_ref, but we will do these again in
css_free_work_fn, so they are redundant.  Especially the css id, that
would cause problem if we remove it twice, since it may be assigned to
another css after the first remove.

tj: This was broken by two commits updating the free path without
    synchronizing the creation failure path.  This can be easily
    triggered by trying to create more than 64k memory cgroups.

Signed-off-by: Wenwei Tao 
Signed-off-by: Tejun Heo 
Cc: Vladimir Davydov 
Fixes: 9a1049da9bd2 ("percpu-refcount: require percpu_ref to be exited explicitly")
Fixes: 01e586598b22 ("cgroup: release css->id after css_free")
Signed-off-by: Greg Kroah-Hartman

sched/debug: Fix deadlock when enabling sched events

2016-08-10T10:54:44+00:00

commit eda8dca519269c92a0771668b3d5678792de7b78 upstream.

I see a hang when enabling sched events:

  echo 1 > /sys/kernel/debug/tracing/events/sched/enable

The printk buffer shows:

  BUG: spinlock recursion on CPU#1, swapper/1/0
   lock: 0xffff88007d5d8c00, .magic: dead4ead, .owner: swapper/1/0, .owner_cpu: 1
  CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc2+ #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
  ...
  Call Trace:
     [] dump_stack+0x85/0xc2
   [] spin_dump+0x78/0xc0
   [] do_raw_spin_lock+0x11a/0x150
   [] _raw_spin_lock+0x61/0x80
   [] ? try_to_wake_up+0x256/0x4e0
   [] try_to_wake_up+0x256/0x4e0
   [] ? _raw_spin_unlock_irqrestore+0x4a/0x80
   [] wake_up_process+0x15/0x20
   [] insert_work+0x84/0xc0
   [] __queue_work+0x18f/0x660
   [] queue_work_on+0x46/0x90
   [] drm_fb_helper_dirty.isra.11+0xcb/0xe0 [drm_kms_helper]
   [] drm_fb_helper_sys_imageblit+0x30/0x40 [drm_kms_helper]
   [] soft_cursor+0x1ad/0x230
   [] bit_cursor+0x649/0x680
   [] ? update_attr.isra.2+0x90/0x90
   [] fbcon_cursor+0x14a/0x1c0
   [] hide_cursor+0x28/0x90
   [] vt_console_print+0x3bf/0x3f0
   [] call_console_drivers.constprop.24+0x183/0x200
   [] console_unlock+0x3d4/0x610
   [] vprintk_emit+0x3c5/0x610
   [] vprintk_default+0x29/0x40
   [] printk+0x57/0x73
   [] enqueue_entity+0xc2e/0xc70
   [] enqueue_task_fair+0x59/0xab0
   [] ? kvm_sched_clock_read+0x9/0x20
   [] ? sched_clock+0x9/0x10
   [] activate_task+0x5c/0xa0
   [] ttwu_do_activate+0x54/0xb0
   [] sched_ttwu_pending+0x7a/0xb0
   [] scheduler_ipi+0x61/0x170
   [] smp_trace_reschedule_interrupt+0x4f/0x2a0
   [] trace_reschedule_interrupt+0x96/0xa0
     [] ? native_safe_halt+0x6/0x10
   [] ? trace_hardirqs_on+0xd/0x10
   [] default_idle+0x20/0x1a0
   [] arch_cpu_idle+0xf/0x20
   [] default_idle_call+0x2f/0x50
   [] cpu_startup_entry+0x37e/0x450
   [] start_secondary+0x160/0x1a0

Note the hang only occurs when echoing the above from a physical serial
console, not from an ssh session.

The bug is caused by a deadlock where the task is trying to grab the rq
lock twice because printk()'s aren't safe in sched code.

Signed-off-by: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Matt Fleming 
Cc: Mel Gorman 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Srikar Dronamraju 
Cc: Thomas Gleixner 
Fixes: cb2517653fcc ("sched/debug: Make schedstats a runtime tunable that is disabled by default")
Link: http://lkml.kernel.org/r/20160613073209.gdvdybiruljbkn3p@treble
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while processing sysrq-w

2016-08-10T10:54:43+00:00

commit 57675cb976eff977aefb428e68e4e0236d48a9ff upstream.

Lengthy output of sysrq-w may take a lot of time on slow serial console.

Currently we reset NMI-watchdog on the current CPU to avoid spurious
lockup messages. Sometimes this doesn't work since softlockup watchdog
might trigger on another CPU which is waiting for an IPI to proceed.
We reset softlockup watchdogs on all CPUs, but we do this only after
listing all tasks, and this may be too late on a busy system.

So, reset watchdogs CPUs earlier, in for_each_process_thread() loop.

Signed-off-by: Andrey Ryabinin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/1465474805-14641-1-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

tracing: Handle NULL formats in hold_module_trace_bprintk_format()

2016-07-27T15:42:15+00:00

commit 70c8217acd4383e069fe1898bbad36ea4fcdbdcc upstream.

If a task uses a non constant string for the format parameter in
trace_printk(), then the trace_printk_fmt variable is set to NULL. This
variable is then saved in the __trace_printk_fmt section.

The function hold_module_trace_bprintk_format() checks to see if duplicate
formats are used by modules, and reuses them if so (saves them to the list
if it is new). But this function calls lookup_format() that does a strcmp()
to the value (which is now NULL) and can cause a kernel oops.

This wasn't an issue till 3debb0a9ddb ("tracing: Fix trace_printk() to print
when not using bprintk()") which added "__used" to the trace_printk_fmt
variable, and before that, the kernel simply optimized it out (no NULL value
was saved).

The fix is simply to handle the NULL pointer in lookup_format() and have the
caller ignore the value if it was NULL.

Link: http://lkml.kernel.org/r/1464769870-18344-1-git-send-email-zhengjun.xing@intel.com

Reported-by: xingzhen 
Acked-by: Namhyung Kim 
Fixes: 3debb0a9ddb ("tracing: Fix trace_printk() to print when not using bprintk()")
Signed-off-by: Steven Rostedt 
Signed-off-by: Greg Kroah-Hartman

sched/fair: Fix cfs_rq avg tracking underflow

2016-07-27T15:42:13+00:00

commit 8974189222159154c55f24ddad33e3613960521a upstream.

As per commit:

  b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")

> the code generated from update_cfs_rq_load_avg():
>
> 	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> 		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> 		sa->load_avg = max_t(long, sa->load_avg - r, 0);
> 		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
> 		removed_load = 1;
> 	}
>
> turns into:
>
> ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
> ffffffff8108706b:       48 85 c0                test   %rax,%rax
> ffffffff8108706e:       74 40                   je     ffffffff810870b0 
> ffffffff81087070:       4c 89 f8                mov    %r15,%rax
> ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
> ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
> ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
> ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
> ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
> ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx
>
> Which you'll note ends up with sa->load_avg -= r in memory at
> ffffffff8108707a.

So I _should_ have looked at other unserialized users of ->load_avg,
but alas. Luckily nikbor reported a similar /0 from task_h_load() which
instantly triggered recollection of this here problem.

Aside from the intermediate value hitting memory and causing problems,
there's another problem: the underflow detection relies on the signed
bit. This reduces the effective width of the variables, IOW its
effectively the same as having these variables be of signed type.

This patch changes to a different means of unsigned underflow
detection to not rely on the signed bit. This allows the variables to
use the 'full' unsigned range. And it does so with explicit LOAD -
STORE to ensure any intermediate value will never be visible in
memory, allowing these unserialized loads.

Note: GCC generates crap code for this, might warrant a look later.

Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
       maybe we should do clamping on add too.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrey Ryabinin 
Cc: Chris Wilson 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Yuyang Du 
Cc: bsegall@google.com
Cc: kernel@kyup.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman