linux-stable.git/kernel, branch v3.16.40

perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race

2017-02-23T03:54:45+00:00

commit 321027c1fe77f892f4ea07846aeae08cefbbb290 upstream.

Di Shen reported a race between two concurrent sys_perf_event_open()
calls where both try and move the same pre-existing software group
into a hardware context.

The problem is exactly that described in commit:

  f63a8daa5812 ("perf: Fix event->ctx locking")

... where, while we wait for a ctx->mutex acquisition, the event->ctx
relation can have changed under us.

That very same commit failed to recognise sys_perf_event_context() as an
external access vector to the events and thereby didn't apply the
established locking rules correctly.

So while one sys_perf_event_open() call is stuck waiting on
mutex_lock_double(), the other (which owns said locks) moves the group
about. So by the time the former sys_perf_event_open() acquires the
locks, the context we've acquired is stale (and possibly dead).

Apply the established locking rules as per perf_event_ctx_lock_nested()
to the mutex_lock_double() for the 'move_group' case. This obviously means
we need to validate state after we acquire the locks.

Reported-by: Di Shen (Keen Lab)
Tested-by: John Dias 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Alexander Shishkin 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Kees Cook 
Cc: Linus Torvalds 
Cc: Min Chong 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Fixes: f63a8daa5812 ("perf: Fix event->ctx locking")
Link: http://lkml.kernel.org/r/20170106131444.GZ3174@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar 
[bwh: Backported to 3.16:
 - Use ACCESS_ONCE() instead of READ_ONCE()
 - Test perf_event::group_flags instead of group_caps
 - Add the err_locked cleanup block, which we didn't need before
 - Adjust context]
Signed-off-by: Ben Hutchings

perf: Do not double free

2017-02-23T03:54:45+00:00

commit 130056275ade730e7a79c110212c8815202773ee upstream.

In case of: err_file: fput(event_file), we'll end up calling
perf_release() which in turn will free the event.

Do not then free the event _again_.

Tested-by: Alexander Shishkin 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Alexander Shishkin 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: dvyukov@google.com
Cc: eranian@google.com
Cc: oleg@redhat.com
Cc: panand@redhat.com
Cc: sasha.levin@oracle.com
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160224174947.697350349@infradead.org
Signed-off-by: Ingo Molnar 
Signed-off-by: Ben Hutchings

perf: Fix event->ctx locking

2017-02-23T03:54:44+00:00

commit f63a8daa5812afef4f06c962351687e1ff9ccb2b upstream.

There have been a few reported issues wrt. the lack of locking around
changing event->ctx. This patch tries to address those.

It avoids the whole rwsem thing; and while it appears to work, please
give it some thought in review.

What I did fail at is sensible runtime checks on the use of
event->ctx, the RCU use makes it very hard.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: Paul E. McKenney 
Cc: Jiri Olsa 
Cc: Arnaldo Carvalho de Melo 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/20150123125834.209535886@infradead.org
Signed-off-by: Ingo Molnar 
Signed-off-by: Ben Hutchings

perf: Fix race in swevent hash

2017-02-23T03:54:44+00:00

commit 12ca6ad2e3a896256f086497a7c7406a547ee373 upstream.

There's a race on CPU unplug where we free the swevent hash array
while it can still have events on. This will result in a
use-after-free which is BAD.

Simply do not free the hash array on unplug. This leaves the thing
around and no use-after-free takes place.

When the last swevent dies, we do a for_each_possible_cpu() iteration
anyway to clean these up, at which time we'll free it, so no leakage
will occur.

Reported-by: Sasha Levin 
Tested-by: Sasha Levin 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho de Melo 
Cc: Frederic Weisbecker 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Signed-off-by: Ingo Molnar 
Signed-off-by: Ben Hutchings

locking/rtmutex: Prevent dequeue vs. unlock race

2017-02-23T03:54:41+00:00

commit dbb26055defd03d59f678cb5f2c992abe05b064a upstream.

David reported a futex/rtmutex state corruption. It's caused by the
following problem:

CPU0		CPU1		CPU2

l->owner=T1
		rt_mutex_lock(l)
		lock(l->wait_lock)
		l->owner = T1 | HAS_WAITERS;
		enqueue(T2)
		boost()
		  unlock(l->wait_lock)
		schedule()

				rt_mutex_lock(l)
				lock(l->wait_lock)
				l->owner = T1 | HAS_WAITERS;
				enqueue(T3)
				boost()
				  unlock(l->wait_lock)
				schedule()
		signal(->T2)	signal(->T3)
		lock(l->wait_lock)
		dequeue(T2)
		deboost()
		  unlock(l->wait_lock)
				lock(l->wait_lock)
				dequeue(T3)
				  ===> wait list is now empty
				deboost()
				 unlock(l->wait_lock)
		lock(l->wait_lock)
		fixup_rt_mutex_waiters()
		  if (wait_list_empty(l)) {
		    owner = l->owner & ~HAS_WAITERS;
		    l->owner = owner
		     ==> l->owner = T1
		  }

				lock(l->wait_lock)
rt_mutex_unlock(l)		fixup_rt_mutex_waiters()
				  if (wait_list_empty(l)) {
				    owner = l->owner & ~HAS_WAITERS;
cmpxchg(l->owner, T1, NULL)
 ===> Success (l->owner = NULL)
				    l->owner = owner
				     ==> l->owner = T1
				  }

That means the problem is caused by fixup_rt_mutex_waiters() which does the
RMW to clear the waiters bit unconditionally when there are no waiters in
the rtmutexes rbtree.

This can be fatal: A concurrent unlock can release the rtmutex in the
fastpath because the waiters bit is not set. If the cmpxchg() gets in the
middle of the RMW operation then the previous owner, which just unlocked
the rtmutex is set as the owner again when the write takes place after the
successfull cmpxchg().

The solution is rather trivial: verify that the owner member of the rtmutex
has the waiters bit set before clearing it. This does not require a
cmpxchg() or other atomic operations because the waiters bit can only be
set and cleared with the rtmutex wait_lock held. It's also safe against the
fast path unlock attempt. The unlock attempt via cmpxchg() will either see
the bit set and take the slowpath or see the bit cleared and release it
atomically in the fastpath.

It's remarkable that the test program provided by David triggers on ARM64
and MIPS64 really quick, but it refuses to reproduce on x86-64, while the
problem exists there as well. That refusal might explain that this got not
discovered earlier despite the bug existing from day one of the rtmutex
implementation more than 10 years ago.

Thanks to David for meticulously instrumenting the code and providing the
information which allowed to decode this subtle problem.

Reported-by: David Daney 
Tested-by: David Daney 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Steven Rostedt 
Acked-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Mark Rutland 
Cc: Peter Zijlstra 
Cc: Sebastian Siewior 
Cc: Will Deacon 
Fixes: 23f78d4a03c5 ("[PATCH] pi-futex: rt mutex core")
Link: http://lkml.kernel.org/r/20161130210030.351136722@linutronix.de
Signed-off-by: Ingo Molnar 
[bwh: Backported to 3.16: use ACCESS_ONCE() instead of {READ,WRITE}_ONCE()]
Signed-off-by: Ben Hutchings

PM / sleep: fix device reference leak in test_suspend

2017-02-23T03:54:27+00:00

commit ceb75787bc75d0a7b88519ab8a68067ac690f55a upstream.

Make sure to drop the reference taken by class_find_device() after
opening the RTC device.

Fixes: 77437fd4e61f (pm: boot time suspend selftest)
Signed-off-by: Johan Hovold 
Signed-off-by: Rafael J. Wysocki 
Signed-off-by: Ben Hutchings

x86/panic: replace smp_send_stop() with kdump friendly version in panic path

2017-02-23T03:54:11+00:00

commit 0ee59413c967c35a6dd2dbdab605b4cd42025ee5 upstream.

Daniel Walker reported problems which happens when
crash_kexec_post_notifiers kernel option is enabled
(https://lkml.org/lkml/2015/6/24/44).

In that case, smp_send_stop() is called before entering kdump routines
which assume other CPUs are still online.  As the result, for x86, kdump
routines fail to save other CPUs' registers and disable virtualization
extensions.

To fix this problem, call a new kdump friendly function,
crash_smp_send_stop(), instead of the smp_send_stop() when
crash_kexec_post_notifiers is enabled.  crash_smp_send_stop() is a weak
function, and it just call smp_send_stop().  Architecture codes should
override it so that kdump can work appropriately.  This patch only
provides x86-specific version.

For Xen's PV kernel, just keep the current behavior.

NOTES:

- Right solution would be to place crash_smp_send_stop() before
  __crash_kexec() invocation in all cases and remove smp_send_stop(), but
  we can't do that until all architectures implement own
  crash_smp_send_stop()

- crash_smp_send_stop()-like work is still needed by
  machine_crash_shutdown() because crash_kexec() can be called without
  entering panic()

Fixes: f06e5153f4ae (kernel/panic.c: add "crash_kexec_post_notifiers" option)
Link: http://lkml.kernel.org/r/20160810080948.11028.15344.stgit@sysi4-13.yrl.intra.hitachi.co.jp
Signed-off-by: Hidehiro Kawai 
Reported-by: Daniel Walker 
Cc: Dave Young 
Cc: Baoquan He 
Cc: Vivek Goyal 
Cc: Eric Biederman 
Cc: Masami Hiramatsu 
Cc: Daniel Walker 
Cc: Xunlei Pang 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Borislav Petkov 
Cc: David Vrabel 
Cc: Toshi Kani 
Cc: Ralf Baechle 
Cc: David Daney 
Cc: Aaro Koskinen 
Cc: "Steven J. Hill" 
Cc: Corey Minyard 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings

genirq/generic_chip: Add irq_unmap callback

2017-02-23T03:53:57+00:00

commit ee26c013cdee0b947e29d6cadfb9ff3341c69ff9 upstream.

Without this patch irq_domain_disassociate() cannot properly release the
interrupt. In fact, irq_map_generic_chip() checks a bit on 'gc->installed'
but said bit is never cleared, only set.

Commit 088f40b7b027 ("genirq: Generic chip: Add linear irq domain support")
added irq_map_generic_chip() function and also stated "This lacks a removal
function for now".

This commit provides an implementation of an unmap function that can be
called by irq_domain_disassociate().

[ tglx: Made the function static and removed the export as we have neither
  	a prototype nor a modular user. ]

Fixes: 088f40b7b027 ("genirq: Generic chip: Add linear irq domain support")
Signed-off-by: Sebastian Frias 
Cc: Marc Zyngier 
Cc: Mason 
Cc: Jason Cooper 
Link: http://lkml.kernel.org/r/579F5C5A.2070507@laposte.net
Signed-off-by: Thomas Gleixner 
[bwh: Backported to 3.16: open-code irq_domain_get_irq_data(),
 irq_domain_set_info()]
Signed-off-by: Ben Hutchings

tracing: Move mutex to protect against resetting of seq data

2016-11-20T01:17:35+00:00

commit 1245800c0f96eb6ebb368593e251d66c01e61022 upstream.

The iter->seq can be reset outside the protection of the mutex. So can
reading of user data. Move the mutex up to the beginning of the function.

Fixes: d7350c3f45694 ("tracing/core: make the read callbacks reentrants")
Reported-by: Al Viro 
Signed-off-by: Steven Rostedt 
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings

sched/core: Fix a race between try_to_wake_up() and a woken up task

2016-11-20T01:17:22+00:00

commit 135e8c9250dd5c8c9aae5984fde6f230d0cbfeaf upstream.

The origin of the issue I've seen is related to
a missing memory barrier between check for task->state and
the check for task->on_rq.

The task being woken up is already awake from a schedule()
and is doing the following:

	do {
		schedule()
		set_current_state(TASK_(UN)INTERRUPTIBLE);
	} while (!cond);

The waker, actually gets stuck doing the following in
try_to_wake_up():

	while (p->on_cpu)
		cpu_relax();

Analysis:

The instance I've seen involves the following race:

 CPU1					CPU2

 while () {
   if (cond)
     break;
   do {
     schedule();
     set_current_state(TASK_UN..)
   } while (!cond);
					wakeup_routine()
					  spin_lock_irqsave(wait_lock)
   raw_spin_lock_irqsave(wait_lock)	  wake_up_process()
 }					  try_to_wake_up()
 set_current_state(TASK_RUNNING);	  ..
 list_del(&waiter.list);

CPU2 wakes up CPU1, but before it can get the wait_lock and set
current state to TASK_RUNNING the following occurs:

 CPU3
 wakeup_routine()
 raw_spin_lock_irqsave(wait_lock)
 if (!list_empty)
   wake_up_process()
   try_to_wake_up()
   raw_spin_lock_irqsave(p->pi_lock)
   ..
   if (p->on_rq && ttwu_wakeup())
   ..
   while (p->on_cpu)
     cpu_relax()
   ..

CPU3 tries to wake up the task on CPU1 again since it finds
it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
after CPU2, CPU3 got it.

CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
the task is spinning on the wait_lock. Interestingly since p->on_rq
is checked under pi_lock, I've noticed that try_to_wake_up() finds
p->on_rq to be 0. This was the most confusing bit of the analysis,
but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
check is not reliable without this fix IMHO. The race is visible
(based on the analysis) only when ttwu_queue() does a remote wakeup
via ttwu_queue_remote. In which case the p->on_rq change is not
done uder the pi_lock.

The result is that after a while the entire system locks up on
the raw_spin_irqlock_save(wait_lock) and the holder spins infintely

Reproduction of the issue:

The issue can be reproduced after a long run on my system with 80
threads and having to tweak available memory to very low and running
memory stress-ng mmapfork test. It usually takes a long time to
reproduce. I am trying to work on a test case that can reproduce
the issue faster, but thats work in progress. I am still testing the
changes on my still in a loop and the tests seem OK thus far.

Big thanks to Benjamin and Nick for helping debug this as well.
Ben helped catch the missing barrier, Nick caught every missing
bit in my theory.

Signed-off-by: Balbir Singh 
[ Updated comment to clarify matching barriers. Many
  architectures do not have a full barrier in switch_to()
  so that cannot be relied upon. ]
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Benjamin Herrenschmidt 
Cc: Alexey Kardashevskiy 
Cc: Linus Torvalds 
Cc: Nicholas Piggin 
Cc: Nicholas Piggin 
Cc: Oleg Nesterov 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/e02cce7b-d9ca-1ad0-7a61-ea97c7582b37@gmail.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Ben Hutchings