linux-stable.git/kernel, branch v4.5.3

perf/core: Fix time tracking bug with multiplexing

2016-05-04T21:49:14+00:00

commit 8fdc65391c6ad16461526a685f03262b3b01bfde upstream.

Stephane reported that commit:

  3cbaa5906967 ("perf: Fix ctx time tracking by introducing EVENT_TIME")

introduced a regression wrt. time tracking, as easily observed by:

> This patch introduce a bug in the time tracking of events when
> multiplexing is used.
>
> The issue is easily reproducible with the following perf run:
>
>  $ perf stat -a -C 0 -e branches,branches,branches,branches,branches,branches -I 1000
>      1.000730239            652,394      branches   (66.41%)
>      1.000730239            597,809      branches   (66.41%)
>      1.000730239            593,870      branches   (66.63%)
>      1.000730239            651,440      branches   (67.03%)
>      1.000730239            656,725      branches   (66.96%)
>      1.000730239            branches
>
> One branches event is shown as not having run. Yet, with
> multiplexing, all events should run especially with a 1s (-I 1000)
> interval. The delta for time_running comes out to 0. Yet, the event
> has run because the kernel is actually multiplexing the events. The
> problem is that the time tracking is the kernel and especially in
> ctx_sched_out() is wrong now.
>
> The problem is that in case that the kernel enters ctx_sched_out() with the
> following state:
>    ctx->is_active=0x7 event_type=0x1
>    Call Trace:
>     [] dump_stack+0x63/0x82
>     [] ctx_sched_out+0x2bc/0x2d0
>     [] perf_mux_hrtimer_handler+0xf6/0x2c0
>     [] ? __perf_install_in_context+0x130/0x130
>     [] __hrtimer_run_queues+0xf8/0x2f0
>     [] hrtimer_interrupt+0xb7/0x1d0
>     [] local_apic_timer_interrupt+0x38/0x60
>     [] smp_apic_timer_interrupt+0x3d/0x50
>     [] apic_timer_interrupt+0x8c/0xa0
>
> In that case, the test:
>       if (is_active & EVENT_TIME)
>
> will be false and the time will not be updated. Time must always be updated on
> sched out.

Fix this by always updating time if EVENT_TIME was set, as opposed to
only updating time when EVENT_TIME changed.

Reported-by: Stephane Eranian 
Tested-by: Stephane Eranian 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Alexander Shishkin 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Cc: kan.liang@intel.com
Cc: namhyung@kernel.org
Fixes: 3cbaa5906967 ("perf: Fix ctx time tracking by introducing EVENT_TIME")
Link: http://lkml.kernel.org/r/20160329072644.GB3408@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

perf/core: Don't leak event in the syscall error path

2016-05-04T21:49:14+00:00

commit 201c2f85bd0bc13b712d9c0b3d11251b182e06ae upstream.

In the error path, event_file not being NULL is used to determine
whether the event itself still needs to be free'd, so fix it up to
avoid leaking.

Reported-by: Leon Yu 
Signed-off-by: Alexander Shishkin 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Fixes: 130056275ade ("perf: Do not double free")
Link: http://lkml.kernel.org/r/87twk06yxp.fsf@ashishki-desk.ger.corp.intel.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

locking/mcs: Fix mcs_spin_lock() ordering

2016-05-04T21:49:10+00:00

commit 920c720aa5aa3900a7f1689228fdfc2580a91e7e upstream.

Similar to commit b4b29f94856a ("locking/osq: Fix ordering of node
initialisation in osq_lock") the use of xchg_acquire() is
fundamentally broken with MCS like constructs.

Furthermore, it turns out we rely on the global transitivity of this
operation because the unlock path observes the pointer with a
READ_ONCE(), not an smp_load_acquire().

This is non-critical because the MCS code isn't actually used and
mostly serves as documentation, a stepping stone to the more complex
things we've build on top of the idea.

Reported-by: Andrea Parri 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Will Deacon 
Fixes: 3552a07a9c4a ("locking/mcs: Use acquire/release semantics")
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback

2016-05-04T21:49:09+00:00

commit 5cf1cacb49aee39c3e02ae87068fc3c6430659b0 upstream.

Since e93ad19d0564 ("cpuset: make mm migration asynchronous"), cpuset
kicks off asynchronous NUMA node migration if necessary during task
migration and flushes it from cpuset_post_attach_flush() which is
called at the end of __cgroup_procs_write().  This is to avoid
performing migration with cgroup_threadgroup_rwsem write-locked which
can lead to deadlock through dependency on kworker creation.

memcg has a similar issue with charge moving, so let's convert it to
an official callback rather than the current one-off cpuset specific
function.  This patch adds cgroup_subsys->post_attach callback and
makes cpuset register cpuset_post_attach_flush() as its ->post_attach.

The conversion is mostly one-to-one except that the new callback is
called under cgroup_mutex.  This is to guarantee that no other
migration operations are started before ->post_attach callbacks are
finished.  cgroup_mutex is one of the outermost mutex in the system
and has never been and shouldn't be a problem.  We can add specialized
synchronization around __cgroup_procs_write() but I don't think
there's any noticeable benefit.

Signed-off-by: Tejun Heo 
Cc: Li Zefan 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Signed-off-by: Greg Kroah-Hartman

workqueue: fix ghost PENDING flag while doing MQ IO

2016-05-04T21:49:09+00:00

commit 346c09f80459a3ad97df1816d6d606169a51001a upstream.

The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
with the following backtrace:

[  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[  601.347574]       Tainted: G           O    4.4.5-1-storage+ #6
[  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  601.348142] kworker/u129:5  D ffff880803077988     0  1636      2 0x00000000
[  601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
[  601.348999]  ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
[  601.349662]  ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
[  601.350333]  ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
[  601.350965] Call Trace:
[  601.351203]  [] ? bit_wait+0x60/0x60
[  601.351444]  [] schedule+0x35/0x80
[  601.351709]  [] schedule_timeout+0x192/0x230
[  601.351958]  [] ? blk_flush_plug_list+0xc7/0x220
[  601.352208]  [] ? ktime_get+0x37/0xa0
[  601.352446]  [] ? bit_wait+0x60/0x60
[  601.352688]  [] io_schedule_timeout+0xa4/0x110
[  601.352951]  [] ? _raw_spin_unlock_irqrestore+0xe/0x10
[  601.353196]  [] bit_wait_io+0x1b/0x70
[  601.353440]  [] __wait_on_bit+0x5d/0x90
[  601.353689]  [] wait_on_page_bit+0xc0/0xd0
[  601.353958]  [] ? autoremove_wake_function+0x40/0x40
[  601.354200]  [] __filemap_fdatawait_range+0xe4/0x140
[  601.354441]  [] filemap_fdatawait_range+0x14/0x30
[  601.354688]  [] filemap_write_and_wait_range+0x3f/0x70
[  601.354932]  [] blkdev_fsync+0x1b/0x50
[  601.355193]  [] vfs_fsync_range+0x49/0xa0
[  601.355432]  [] blkdev_write_iter+0xca/0x100
[  601.355679]  [] __vfs_write+0xaa/0xe0
[  601.355925]  [] vfs_write+0xa9/0x1a0
[  601.356164]  [] kernel_write+0x38/0x50

The underlying device is a null_blk, with default parameters:

  queue_mode    = MQ
  submit_queues = 1

Verification that nullb0 has something inflight:

root@pserver8:~# cat /sys/block/nullb0/inflight
       0        1
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
        ffff8838038e2400
...

During debug it became clear that stalled request is always inserted in
the rq_list from the following path:

   save_stack_trace_tsk + 34
   blk_mq_insert_requests + 231
   blk_mq_flush_plug_list + 281
   blk_flush_plug_list + 199
   wait_on_page_bit + 192
   __filemap_fdatawait_range + 228
   filemap_fdatawait_range + 20
   filemap_write_and_wait_range + 63
   blkdev_fsync + 27
   vfs_fsync_range + 73
   blkdev_write_iter + 202
   __vfs_write + 170
   vfs_write + 169
   kernel_write + 56

So blk_flush_plug_list() was called with from_schedule == true.

If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().

That means, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.

Further debugging shows the following traces from different CPUs:

  CPU#0                                  CPU#1
  ----------------------------------     -------------------------------
  reqeust A inserted
  STORE hctx->ctx_map[0] bit marked
  kblockd_schedule...() returns 1
  
                                         request B inserted
                                         STORE hctx->ctx_map[1] bit marked
                                         kblockd_schedule...() returns 0
  *** WORK PENDING bit is cleared ***
  flush_busy_ctxs() is executed, but
  bit 1, set by CPU#1, is not observed

As a result request B pended forever.

This behaviour can be explained by speculative LOAD of hctx->ctx_map on
CPU#0, which is reordered with clear of PENDING bit and executed _before_
actual STORE of bit 1 on CPU#1.

The proper fix is an explicit full barrier , which guarantees
that clear of PENDING bit is to be executed before all possible
speculative LOADS or STORES inside actual work function.

Signed-off-by: Roman Pen 
Cc: Gioh Kim 
Cc: Michael Wang 
Cc: Tejun Heo 
Cc: Jens Axboe 
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo 
Signed-off-by: Greg Kroah-Hartman

futex: Acknowledge a new waiter in counter before plist

2016-05-04T21:49:02+00:00

commit fe1bce9e2107ba3a8faffe572483b6974201a0e6 upstream.

Otherwise an incoming waker on the dest hash bucket can miss
the waiter adding itself to the plist during the lockless
check optimization (small window but still the correct way
of doing this); similarly to the decrement counterpart.

Suggested-by: Peter Zijlstra 
Signed-off-by: Davidlohr Bueso 
Cc: Davidlohr Bueso 
Cc: bigeasy@linutronix.de
Cc: dvhart@infradead.org
Link: http://lkml.kernel.org/r/1461208164-29150-1-git-send-email-dave@stgolabs.net
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman

futex: Handle unlock_pi race gracefully

2016-05-04T21:49:02+00:00

commit 89e9e66ba1b3bde9d8ea90566c2aee20697ad681 upstream.

If userspace calls UNLOCK_PI unconditionally without trying the TID -> 0
transition in user space first then the user space value might not have the
waiters bit set. This opens the following race:

CPU0	    	      	    CPU1
uval = get_user(futex)
			    lock(hb)
lock(hb)
			    futex |= FUTEX_WAITERS
			    ....
			    unlock(hb)

cmpxchg(futex, uval, newval)

So the cmpxchg fails and returns -EINVAL to user space, which is wrong because
the futex value is valid.

To handle this (yes, yet another) corner case gracefully, check for a flag
change and retry.

[ tglx: Massaged changelog and slightly reworked implementation ]

Fixes: ccf9e6a80d9e ("futex: Make unlock_pi more robust")
Signed-off-by: Sebastian Andrzej Siewior 
Cc: Davidlohr Bueso 
Cc: Darren Hart 
Cc: Peter Zijlstra 
Link: http://lkml.kernel.org/r/1460723739-5195-1-git-send-email-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman

sched/cgroup: Fix/cleanup cgroup teardown/init

2016-05-04T21:49:00+00:00

commit 2f5177f0fd7e531b26d54633be62d1d4cb94621c upstream.

The CPU controller hasn't kept up with the various changes in the whole
cgroup initialization / destruction sequence, and commit:

  2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")

caused it to explode.

The reason for this is that zombies do not inhibit css_offline() from
being called, but do stall css_released(). Now we tear down the cfs_rq
structures on css_offline() but zombies can run after that, leading to
use-after-free issues.

The solution is to move the tear-down to css_released(), which
guarantees nobody (including no zombies) is still using our cgroup.

Furthermore, a few simple cleanups are possible too. There doesn't
appear to be any point to us using css_online() (anymore?) so fold that
in css_alloc().

And since cgroup code guarantees an RCU grace period between
css_released() and css_free() we can forgo using call_rcu() and free the
stuff immediately.

Suggested-by: Tejun Heo 
Reported-by: Kazuki Yamaguchi 
Reported-by: Niklas Cassel 
Tested-by: Niklas Cassel 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Tejun Heo 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
Link: http://lkml.kernel.org/r/20160316152245.GY6344@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

bpf: avoid copying junk bytes in bpf_get_current_comm()

2016-04-20T06:45:10+00:00

[ Upstream commit cdc4e47da8f4c32eeb6b2061a8a834f4362a12b7 ]

Lots of places in the kernel use memcpy(buf, comm, TASK_COMM_LEN); but
the result is typically passed to print("%s", buf) and extra bytes
after zero don't cause any harm.
In bpf the result of bpf_get_current_comm() is used as the part of
map key and was causing spurious hash map mismatches.
Use strlcpy() to guarantee zero-terminated string.
bpf verifier checks that output buffer is zero-initialized,
so even for short task names the output buffer don't have junk bytes.
Note it's not a security concern, since kprobe+bpf is root only.

Fixes: ffeedafbf023 ("bpf: introduce current->pid, tgid, uid, gid, comm accessors")
Reported-by: Tobias Waldekranz 
Signed-off-by: Alexei Starovoitov 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

sched/cputime: Fix steal time accounting vs. CPU hotplug

2016-04-12T14:33:49+00:00

commit e9532e69b8d1d1284e8ecf8d2586de34aec61244 upstream.

On CPU hotplug the steal time accounting can keep a stale rq->prev_steal_time
value over CPU down and up. So after the CPU comes up again the delta
calculation in steal_account_process_tick() wreckages itself due to the
unsigned math:

	 u64 steal = paravirt_steal_clock(smp_processor_id());

	 steal -= this_rq()->prev_steal_time;

So if steal is smaller than rq->prev_steal_time we end up with an insane large
value which then gets added to rq->prev_steal_time, resulting in a permanent
wreckage of the accounting. As a consequence the per CPU stats in /proc/stat
become stale.

Nice trick to tell the world how idle the system is (100%) while the CPU is
100% busy running tasks. Though we prefer realistic numbers.

None of the accounting values which use a previous value to account for
fractions is reset at CPU hotplug time. update_rq_clock_task() has a sanity
check for prev_irq_time and prev_steal_time_rq, but that sanity check solely
deals with clock warps and limits the /proc/stat visible wreckage. The
prev_time values are still wrong.

Solution is simple: Reset rq->prev_*_time when the CPU is plugged in again.

Signed-off-by: Thomas Gleixner 
Acked-by: Rik van Riel 
Cc: Frederic Weisbecker 
Cc: Glauber Costa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Fixes: commit 095c0aa83e52 "sched: adjust scheduler cpu power for stolen time"
Fixes: commit aa483808516c "sched: Remove irq time from available CPU power"
Fixes: commit e6e6685accfa "KVM guest: Steal time accounting"
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603041539490.3686@nanos
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman