linux-stable.git/kernel/locking, branch v3.16.40

locking/rtmutex: Prevent dequeue vs. unlock race

2017-02-23T03:54:41+00:00

commit dbb26055defd03d59f678cb5f2c992abe05b064a upstream.

David reported a futex/rtmutex state corruption. It's caused by the
following problem:

CPU0		CPU1		CPU2

l->owner=T1
		rt_mutex_lock(l)
		lock(l->wait_lock)
		l->owner = T1 | HAS_WAITERS;
		enqueue(T2)
		boost()
		  unlock(l->wait_lock)
		schedule()

				rt_mutex_lock(l)
				lock(l->wait_lock)
				l->owner = T1 | HAS_WAITERS;
				enqueue(T3)
				boost()
				  unlock(l->wait_lock)
				schedule()
		signal(->T2)	signal(->T3)
		lock(l->wait_lock)
		dequeue(T2)
		deboost()
		  unlock(l->wait_lock)
				lock(l->wait_lock)
				dequeue(T3)
				  ===> wait list is now empty
				deboost()
				 unlock(l->wait_lock)
		lock(l->wait_lock)
		fixup_rt_mutex_waiters()
		  if (wait_list_empty(l)) {
		    owner = l->owner & ~HAS_WAITERS;
		    l->owner = owner
		     ==> l->owner = T1
		  }

				lock(l->wait_lock)
rt_mutex_unlock(l)		fixup_rt_mutex_waiters()
				  if (wait_list_empty(l)) {
				    owner = l->owner & ~HAS_WAITERS;
cmpxchg(l->owner, T1, NULL)
 ===> Success (l->owner = NULL)
				    l->owner = owner
				     ==> l->owner = T1
				  }

That means the problem is caused by fixup_rt_mutex_waiters() which does the
RMW to clear the waiters bit unconditionally when there are no waiters in
the rtmutexes rbtree.

This can be fatal: A concurrent unlock can release the rtmutex in the
fastpath because the waiters bit is not set. If the cmpxchg() gets in the
middle of the RMW operation then the previous owner, which just unlocked
the rtmutex is set as the owner again when the write takes place after the
successfull cmpxchg().

The solution is rather trivial: verify that the owner member of the rtmutex
has the waiters bit set before clearing it. This does not require a
cmpxchg() or other atomic operations because the waiters bit can only be
set and cleared with the rtmutex wait_lock held. It's also safe against the
fast path unlock attempt. The unlock attempt via cmpxchg() will either see
the bit set and take the slowpath or see the bit cleared and release it
atomically in the fastpath.

It's remarkable that the test program provided by David triggers on ARM64
and MIPS64 really quick, but it refuses to reproduce on x86-64, while the
problem exists there as well. That refusal might explain that this got not
discovered earlier despite the bug existing from day one of the rtmutex
implementation more than 10 years ago.

Thanks to David for meticulously instrumenting the code and providing the
information which allowed to decode this subtle problem.

Reported-by: David Daney 
Tested-by: David Daney 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Steven Rostedt 
Acked-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Mark Rutland 
Cc: Peter Zijlstra 
Cc: Sebastian Siewior 
Cc: Will Deacon 
Fixes: 23f78d4a03c5 ("[PATCH] pi-futex: rt mutex core")
Link: http://lkml.kernel.org/r/20161130210030.351136722@linutronix.de
Signed-off-by: Ingo Molnar 
[bwh: Backported to 3.16: use ACCESS_ONCE() instead of {READ,WRITE}_ONCE()]
Signed-off-by: Ben Hutchings

sched: Handle priority boosted tasks proper in setscheduler()

2015-05-28T09:00:00+00:00

commit 0782e63bc6fe7e2d3408d250df11d388b7799c6b upstream.

Ronny reported that the following scenario is not handled correctly:

	T1 (prio = 10)
	   lock(rtmutex);

	T2 (prio = 20)
	   lock(rtmutex)
	      boost T1

	T1 (prio = 20)
	   sys_set_scheduler(prio = 30)
	   T1 prio = 30
	   ....
	   sys_set_scheduler(prio = 10)
	   T1 prio = 30

The last step is wrong as T1 should now be back at prio 20.

Commit c365c292d059 ("sched: Consider pi boosting in setscheduler()")
only handles the case where a boosted tasks tries to lower its
priority.

Fix it by taking the new effective priority into account for the
decision whether a change of the priority is required.

Reported-by: Ronny Meeus 
Tested-by: Steven Rostedt 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Steven Rostedt 
Cc: Borislav Petkov 
Cc: H. Peter Anvin 
Cc: Mike Galbraith 
Fixes: c365c292d059 ("sched: Consider pi boosting in setscheduler()")
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1505051806060.4225@nanos
Signed-off-by: Ingo Molnar 
[ luis: backported to 3.16: adjusted context ]
Signed-off-by: Luis Henriques

locking/rtmutex: Avoid a NULL pointer dereference on deadlock

2015-03-02T15:04:30+00:00

commit 8d1e5a1a1ccf5ae9d8a5a0ee7960202ccb0c5429 upstream.

With task_blocks_on_rt_mutex() returning early -EDEADLK we never
add the waiter to the waitqueue. Later, we try to remove it via
remove_waiter() and go boom in rt_mutex_top_waiter() because
rb_entry() gives a NULL pointer.

( Tested on v3.18-RT where rtmutex is used for regular mutex and I
  tried to get one twice in a row. )

Not sure when this started but I guess 397335f004f4 ("rtmutex: Fix
deadlock detector for real") or commit 3d5c9340d194 ("rtmutex:
Handle deadlock detection smarter").

Signed-off-by: Sebastian Andrzej Siewior 
Acked-by: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/1424187823-19600-1-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar 
[ luis: backported to 3.16: adjusted context ]
Signed-off-by: Luis Henriques

locking/rwsem: Add CONFIG_RWSEM_SPIN_ON_OWNER

2014-07-16T12:57:13+00:00

Just like with mutexes (CONFIG_MUTEX_SPIN_ON_OWNER),
encapsulate the dependencies for rwsem optimistic spinning.
No logical changes here as it continues to depend on both
SMP and the XADD algorithm variant.

Signed-off-by: Davidlohr Bueso 
Acked-by: Jason Low 
[ Also make it depend on ARCH_SUPPORTS_ATOMIC_RMW. ]
Signed-off-by: Peter Zijlstra 
Link: http://lkml.kernel.org/r/1405112406-13052-2-git-send-email-davidlohr@hp.com
Cc: aswin@hp.com
Cc: Chris Mason 
Cc: Davidlohr Bueso 
Cc: Josef Bacik 
Cc: Linus Torvalds 
Cc: Waiman Long 
Signed-off-by: Ingo Molnar 

Signed-off-by: Ingo Molnar

locking/rwsem: Rename 'activity' to 'count'

2014-07-16T12:56:55+00:00

There are two definitions of struct rw_semaphore, one in linux/rwsem.h
and one in linux/rwsem-spinlock.h.

For some reason they have different names for the initial field. This
makes it impossible to use C99 named initialization for
__RWSEM_INITIALIZER() -- or we have to duplicate that entire thing
along with the structure definitions.

The simpler patch is renaming the rwsem-spinlock variant to match the
regular rwsem.

This allows us to switch to C99 named initialization.

Signed-off-by: Peter Zijlstra 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/n/tip-bmrZolsbGmautmzrerog27io@git.kernel.org
Signed-off-by: Ingo Molnar

locking/spinlocks/mcs: Micro-optimize osq_unlock()

2014-07-16T11:28:06+00:00

In the unlock function of the cancellable MCS spinlock, the first
thing we do is to retrive the current CPU's osq node. However, due to
the changes made in the previous patch, in the common case where the
lock is not contended, we wouldn't need to access the current CPU's
osq node anymore.

This patch optimizes this by only retriving this CPU's osq node
after we attempt the initial cmpxchg to unlock the osq and found
that its contended.

Signed-off-by: Jason Low 
Signed-off-by: Peter Zijlstra 
Cc: Scott Norton 
Cc: "Paul E. McKenney" 
Cc: Dave Chinner 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
Cc: Rik van Riel 
Cc: Andrew Morton 
Cc: "H. Peter Anvin" 
Cc: Steven Rostedt 
Cc: Tim Chen 
Cc: Konrad Rzeszutek Wilk 
Cc: Aswin Chandramouleeswaran 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/1405358872-3732-5-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar

locking/spinlocks/mcs: Introduce and use init macro and function for osq locks

2014-07-16T11:28:05+00:00

Currently, we initialize the osq lock by directly setting the lock's values. It
would be preferable if we use an init macro to do the initialization like we do
with other locks.

This patch introduces and uses a macro and function for initializing the osq lock.

Signed-off-by: Jason Low 
Signed-off-by: Peter Zijlstra 
Cc: Scott Norton 
Cc: "Paul E. McKenney" 
Cc: Dave Chinner 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
Cc: Rik van Riel 
Cc: Andrew Morton 
Cc: "H. Peter Anvin" 
Cc: Steven Rostedt 
Cc: Tim Chen 
Cc: Konrad Rzeszutek Wilk 
Cc: Aswin Chandramouleeswaran 
Cc: Linus Torvalds 
Cc: Chris Mason 
Cc: Josef Bacik 
Link: http://lkml.kernel.org/r/1405358872-3732-4-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar

locking/spinlocks/mcs: Convert osq lock to atomic_t to reduce overhead

2014-07-16T11:28:04+00:00

The cancellable MCS spinlock is currently used to queue threads that are
doing optimistic spinning. It uses per-cpu nodes, where a thread obtaining
the lock would access and queue the local node corresponding to the CPU that
it's running on. Currently, the cancellable MCS lock is implemented by using
pointers to these nodes.

In this patch, instead of operating on pointers to the per-cpu nodes, we
store the CPU numbers in which the per-cpu nodes correspond to in atomic_t.
A similar concept is used with the qspinlock.

By operating on the CPU # of the nodes using atomic_t instead of pointers
to those nodes, this can reduce the overhead of the cancellable MCS spinlock
by 32 bits (on 64 bit systems).

Signed-off-by: Jason Low 
Signed-off-by: Peter Zijlstra 
Cc: Scott Norton 
Cc: "Paul E. McKenney" 
Cc: Dave Chinner 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
Cc: Rik van Riel 
Cc: Andrew Morton 
Cc: "H. Peter Anvin" 
Cc: Steven Rostedt 
Cc: Tim Chen 
Cc: Konrad Rzeszutek Wilk 
Cc: Aswin Chandramouleeswaran 
Cc: Linus Torvalds 
Cc: Chris Mason 
Cc: Heiko Carstens 
Cc: Josef Bacik 
Link: http://lkml.kernel.org/r/1405358872-3732-3-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar

locking/spinlocks/mcs: Rename optimistic_spin_queue() to optimistic_spin_node()

2014-07-16T11:28:03+00:00

Currently, the per-cpu nodes structure for the cancellable MCS spinlock is
named "optimistic_spin_queue". However, in a follow up patch in the series
we will be introducing a new structure that serves as the new "handle" for
the lock. It would make more sense if that structure is named
"optimistic_spin_queue". Additionally, since the current use of the
"optimistic_spin_queue" structure are  "nodes", it might be better if we
rename them to "node" anyway.

This preparatory patch renames all current "optimistic_spin_queue"
to "optimistic_spin_node".

Signed-off-by: Jason Low 
Signed-off-by: Peter Zijlstra 
Cc: Scott Norton 
Cc: "Paul E. McKenney" 
Cc: Dave Chinner 
Cc: Waiman Long 
Cc: Davidlohr Bueso 
Cc: Rik van Riel 
Cc: Andrew Morton 
Cc: "H. Peter Anvin" 
Cc: Steven Rostedt 
Cc: Tim Chen 
Cc: Konrad Rzeszutek Wilk 
Cc: Aswin Chandramouleeswaran 
Cc: Linus Torvalds 
Cc: Chris Mason 
Cc: Heiko Carstens 
Cc: Josef Bacik 
Link: http://lkml.kernel.org/r/1405358872-3732-2-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar

locking/rwsem: Allow conservative optimistic spinning when readers have lock

2014-07-16T11:28:02+00:00

Commit 4fc828e24cd9 ("locking/rwsem: Support optimistic spinning")
introduced a major performance regression for workloads such as
xfs_repair which mix read and write locking of the mmap_sem across
many threads. The result was xfs_repair ran 5x slower on 3.16-rc2
than on 3.15 and using 20x more system CPU time.

Perf profiles indicate in some workloads that significant time can
be spent spinning on !owner. This is because we don't set the lock
owner when readers(s) obtain the rwsem.

In this patch, we'll modify rwsem_can_spin_on_owner() such that we'll
return false if there is no lock owner. The rationale is that if we
just entered the slowpath, yet there is no lock owner, then there is
a possibility that a reader has the lock. To be conservative, we'll
avoid spinning in these situations.

This patch reduced the total run time of the xfs_repair workload from
about 4 minutes 24 seconds down to approximately 1 minute 26 seconds,
back to close to the same performance as on 3.15.

Retesting of AIM7, which were some of the workloads used to test the
original optimistic spinning code, confirmed that we still get big
performance gains with optimistic spinning, even with this additional
regression fix. Davidlohr found that while the 'custom' workload took
a performance hit of ~-14% to throughput for >300 users with this
additional patch, the overall gain with optimistic spinning is
still ~+45%. The 'disk' workload even improved by ~+15% at >1000 users.

Tested-by: Dave Chinner 
Acked-by: Davidlohr Bueso 
Signed-off-by: Jason Low 
Signed-off-by: Peter Zijlstra 
Cc: Tim Chen 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/1404532172.2572.30.camel@j-VirtualBox
Signed-off-by: Ingo Molnar