linux-stable.git/kernel/rcu/tree.h, branch v4.8

rcu: Correctly handle sparse possible cpus

2016-06-15T23:00:05+00:00

In many cases in the RCU tree code, we iterate over the set of cpus for
a leaf node described by rcu_node::grplo and rcu_node::grphi, checking
per-cpu data for each cpu in this range. However, if the set of possible
cpus is sparse, some cpus described in this range are not possible, and
thus no per-cpu region will have been allocated (or initialised) for
them by the generic percpu code.

Erroneous accesses to a per-cpu area for these !possible cpus may fault
or may hit other data depending on the addressed generated when the
erroneous per cpu offset is applied. In practice, both cases have been
observed on arm64 hardware (the former being silent, but detectable with
additional patches).

To avoid issues resulting from this, we must iterate over the set of
*possible* cpus for a given leaf node. This patch add a new helper,
for_each_leaf_node_possible_cpu, to enable this. As iteration is often
intertwined with rcu_node local bitmask manipulation, a new
leaf_node_cpu_bit helper is added to make this simpler and more
consistent. The RCU tree code is made to use both of these where
appropriate.

Without this patch, running reboot at a shell can result in an oops
like:

[ 3369.075979] Unable to handle kernel paging request at virtual address ffffff8008b21b4c
[ 3369.083881] pgd = ffffffc3ecdda000
[ 3369.087270] [ffffff8008b21b4c] *pgd=00000083eca48003, *pud=00000083eca48003, *pmd=0000000000000000
[ 3369.096222] Internal error: Oops: 96000007 [#1] PREEMPT SMP
[ 3369.101781] Modules linked in:
[ 3369.104825] CPU: 2 PID: 1817 Comm: NetworkManager Tainted: G        W       4.6.0+ #3
[ 3369.121239] task: ffffffc0fa13e000 ti: ffffffc3eb940000 task.ti: ffffffc3eb940000
[ 3369.128708] PC is at sync_rcu_exp_select_cpus+0x188/0x510
[ 3369.134094] LR is at sync_rcu_exp_select_cpus+0x104/0x510
[ 3369.139479] pc : [] lr : [] pstate: 200001c5
[ 3369.146860] sp : ffffffc3eb9435a0
[ 3369.150162] x29: ffffffc3eb9435a0 x28: ffffff8008be4f88
[ 3369.155465] x27: ffffff8008b66c80 x26: ffffffc3eceb2600
[ 3369.160767] x25: 0000000000000001 x24: ffffff8008be4f88
[ 3369.166070] x23: ffffff8008b51c3c x22: ffffff8008b66c80
[ 3369.171371] x21: 0000000000000001 x20: ffffff8008b21b40
[ 3369.176673] x19: ffffff8008b66c80 x18: 0000000000000000
[ 3369.181975] x17: 0000007fa951a010 x16: ffffff80086a30f0
[ 3369.187278] x15: 0000007fa9505590 x14: 0000000000000000
[ 3369.192580] x13: ffffff8008b51000 x12: ffffffc3eb940000
[ 3369.197882] x11: 0000000000000006 x10: ffffff8008b51b78
[ 3369.203184] x9 : 0000000000000001 x8 : ffffff8008be4000
[ 3369.208486] x7 : ffffff8008b21b40 x6 : 0000000000001003
[ 3369.213788] x5 : 0000000000000000 x4 : ffffff8008b27280
[ 3369.219090] x3 : ffffff8008b21b4c x2 : 0000000000000001
[ 3369.224406] x1 : 0000000000000001 x0 : 0000000000000140
...
[ 3369.972257] [] sync_rcu_exp_select_cpus+0x188/0x510
[ 3369.978685] [] synchronize_rcu_expedited+0x64/0xa8
[ 3369.985026] [] synchronize_net+0x24/0x30
[ 3369.990499] [] dev_deactivate_many+0x28c/0x298
[ 3369.996493] [] __dev_close_many+0x60/0xd0
[ 3370.002052] [] __dev_close+0x28/0x40
[ 3370.007178] [] __dev_change_flags+0x8c/0x158
[ 3370.012999] [] dev_change_flags+0x20/0x60
[ 3370.018558] [] do_setlink+0x288/0x918
[ 3370.023771] [] rtnl_newlink+0x398/0x6a8
[ 3370.029158] [] rtnetlink_rcv_msg+0xe4/0x220
[ 3370.034891] [] netlink_rcv_skb+0xc4/0xf8
[ 3370.040364] [] rtnetlink_rcv+0x2c/0x40
[ 3370.045663] [] netlink_unicast+0x160/0x238
[ 3370.051309] [] netlink_sendmsg+0x2f0/0x358
[ 3370.056956] [] sock_sendmsg+0x18/0x30
[ 3370.062168] [] ___sys_sendmsg+0x26c/0x280
[ 3370.067728] [] __sys_sendmsg+0x44/0x88
[ 3370.073027] [] SyS_sendmsg+0x10/0x20
[ 3370.078153] [] el0_svc_naked+0x24/0x28

Signed-off-by: Mark Rutland 
Reported-by: Dennis Chen 
Cc: Catalin Marinas 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Mathieu Desnoyers 
Cc: Steve Capper 
Cc: Steven Rostedt 
Cc: Will Deacon 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Paul E. McKenney

Merge branches 'doc.2016.04.19a', 'exp.2016.03.31d', 'fixes.2016.03.31d' and 'torture.2016.04.21a' into HEAD

2016-04-21T20:48:20+00:00

doc.2016.04.19a: Documentation updates
exp.2016.03.31d: Expedited grace-period updates
fixes.2016.03.31d: Miscellaneous fixes
torture.2016.004.21a Torture-test updates

rcu: Awaken grace-period kthread if too long since FQS

2016-03-31T20:34:50+00:00

Recent kernels can fail to awaken the grace-period kthread for
quiescent-state forcing.  This commit is a crude hack that does
a wakeup if a scheduling-clock interrupt sees that it has been
too long since force-quiescent-state (FQS) processing.

Signed-off-by: Paul E. McKenney

rcu: Overlap wakeups with next expedited grace period

2016-03-31T20:34:11+00:00

The current expedited grace-period implementation makes subsequent grace
periods wait on wakeups for the prior grace period.  This does not fit
the dictionary definition of "expedited", so this commit allows these two
phases to overlap.  Doing this requires four waitqueues rather than two
because tasks can now be waiting on the previous, current, and next grace
periods.  The fourth waitqueue makes the bit masking work out nicely.

Signed-off-by: Paul E. McKenney

rcu: Enforce expedited-GP fairness via funnel wait queue

2016-03-31T20:34:08+00:00

The current mutex-based funnel-locking approach used by expedited grace
periods is subject to severe unfairness.  The problem arises when a
few tasks, making a path from leaves to root, all wake up before other
tasks do.  A new task can then follow this path all the way to the root,
which needlessly delays tasks whose grace period is done, but who do
not happen to acquire the lock quickly enough.

This commit avoids this problem by maintaining per-rcu_node wait queues,
along with a per-rcu_node counter that tracks the latest grace period
sought by an earlier task to visit this node.  If that grace period
would satisfy the current task, instead of proceeding up the tree,
it waits on the current rcu_node structure using a pair of wait queues
provided for that purpose.  This decouples awakening of old tasks from
the arrival of new tasks.

If the wakeups prove to be a bottleneck, additional kthreads can be
brought to bear for that purpose.

Signed-off-by: Paul E. McKenney

rcu: Shorten expedited_workdone* to exp_workdone*

2016-03-31T20:34:08+00:00

Just a name change to save a few lines and a bit of typing.

Signed-off-by: Paul E. McKenney

rcu: Remove expedited GP funnel-lock bypass

2016-03-31T20:34:07+00:00

Commit #cdacbe1f91264 ("rcu: Add fastpath bypassing funnel locking")
turns out to be a pessimization at high load because it forces a tree
full of tasks to wait for an expedited grace period that they probably
do not need.  This commit therefore removes this optimization.

Signed-off-by: Paul E. McKenney

Merge commit 'fixes.2015.02.23a' into core/rcu

2016-03-15T08:01:06+00:00

 Conflicts:
	kernel/rcu/tree.c

Signed-off-by: Ingo Molnar

rcu: Use simple wait queues where possible in rcutree

2016-02-25T10:27:16+00:00

As of commit dae6e64d2bcfd ("rcu: Introduce proper blocking to no-CBs kthreads
GP waits") the RCU subsystem started making use of wait queues.

Here we convert all additions of RCU wait queues to use simple wait queues,
since they don't need the extra overhead of the full wait queue features.

Originally this was done for RT kernels[1], since we would get things like...

  BUG: sleeping function called from invalid context at kernel/rtmutex.c:659
  in_atomic(): 1, irqs_disabled(): 1, pid: 8, name: rcu_preempt
  Pid: 8, comm: rcu_preempt Not tainted
  Call Trace:
   [] __might_sleep+0xd0/0xf0
   [] rt_spin_lock+0x24/0x50
   [] __wake_up+0x36/0x70
   [] rcu_gp_kthread+0x4d2/0x680
   [] ? __init_waitqueue_head+0x50/0x50
   [] ? rcu_gp_fqs+0x80/0x80
   [] kthread+0xdb/0xe0
   [] ? finish_task_switch+0x52/0x100
   [] kernel_thread_helper+0x4/0x10
   [] ? __init_kthread_worker+0x60/0x60
   [] ? gs_change+0xb/0xb

...and hence simple wait queues were deployed on RT out of necessity
(as simple wait uses a raw lock), but mainline might as well take
advantage of the more streamline support as well.

[1] This is a carry forward of work from v3.10-rt; the original conversion
was by Thomas on an earlier -rt version, and Sebastian extended it to
additional post-3.10 added RCU waiters; here I've added a commit log and
unified the RCU changes into one, and uprev'd it to match mainline RCU.

Signed-off-by: Daniel Wagner 
Acked-by: Peter Zijlstra (Intel) 
Cc: linux-rt-users@vger.kernel.org
Cc: Boqun Feng 
Cc: Marcelo Tosatti 
Cc: Steven Rostedt 
Cc: Paul Gortmaker 
Cc: Paolo Bonzini 
Cc: "Paul E. McKenney" 
Link: http://lkml.kernel.org/r/1455871601-27484-6-git-send-email-wagi@monom.org
Signed-off-by: Thomas Gleixner

rcu: Do not call rcu_nocb_gp_cleanup() while holding rnp->lock

2016-02-25T10:27:16+00:00

rcu_nocb_gp_cleanup() is called while holding rnp->lock. Currently,
this is okay because the wake_up_all() in rcu_nocb_gp_cleanup() will
not enable the IRQs. lockdep is happy.

By switching over using swait this is not true anymore. swake_up_all()
enables the IRQs while processing the waiters. __do_softirq() can now
run and will eventually call rcu_process_callbacks() which wants to
grap nrp->lock.

Let's move the rcu_nocb_gp_cleanup() call outside the lock before we
switch over to swait.

If we would hold the rnp->lock and use swait, lockdep reports
following:

 =================================
 [ INFO: inconsistent lock state ]
 4.2.0-rc5-00025-g9a73ba0 #136 Not tainted
 ---------------------------------
 inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
 rcu_preempt/8 [HC0[0]:SC0[0]:HE1:SE1] takes:
  (rcu_node_1){+.?...}, at: [] rcu_gp_kthread+0xb97/0xeb0
 {IN-SOFTIRQ-W} state was registered at:
   [] __lock_acquire+0xd5f/0x21e0
   [] lock_acquire+0xdf/0x2b0
   [] _raw_spin_lock_irqsave+0x59/0xa0
   [] rcu_process_callbacks+0x141/0x3c0
   [] __do_softirq+0x14d/0x670
   [] irq_exit+0x104/0x110
   [] smp_apic_timer_interrupt+0x46/0x60
   [] apic_timer_interrupt+0x70/0x80
   [] rq_attach_root+0xa6/0x100
   [] cpu_attach_domain+0x16d/0x650
   [] build_sched_domains+0x942/0xb00
   [] sched_init_smp+0x509/0x5c1
   [] kernel_init_freeable+0x172/0x28f
   [] kernel_init+0xe/0xe0
   [] ret_from_fork+0x3f/0x70
 irq event stamp: 76
 hardirqs last  enabled at (75): [] _raw_spin_unlock_irq+0x30/0x60
 hardirqs last disabled at (76): [] _raw_spin_lock_irq+0x1f/0x90
 softirqs last  enabled at (0): [] copy_process.part.26+0x602/0x1cf0
 softirqs last disabled at (0): [<          (null)>]           (null)
 other info that might help us debug this:
  Possible unsafe locking scenario:
        CPU0
        ----
   lock(rcu_node_1);
   
     lock(rcu_node_1);
  *** DEADLOCK ***
 1 lock held by rcu_preempt/8:
  #0:  (rcu_node_1){+.?...}, at: [] rcu_gp_kthread+0xb97/0xeb0
 stack backtrace:
 CPU: 0 PID: 8 Comm: rcu_preempt Not tainted 4.2.0-rc5-00025-g9a73ba0 #136
 Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 01/16/2014
  0000000000000000 000000006d7e67d8 ffff881fb081fbd8 ffffffff818379e0
  0000000000000000 ffff881fb0812a00 ffff881fb081fc38 ffffffff8110813b
  0000000000000000 0000000000000001 ffff881f00000001 ffffffff8102fa4f
 Call Trace:
  [] dump_stack+0x4f/0x7b
  [] print_usage_bug+0x1db/0x1e0
  [] ? save_stack_trace+0x2f/0x50
  [] mark_lock+0x66d/0x6e0
  [] ? check_usage_forwards+0x150/0x150
  [] mark_held_locks+0x78/0xa0
  [] ? _raw_spin_unlock_irq+0x30/0x60
  [] trace_hardirqs_on_caller+0x168/0x220
  [] trace_hardirqs_on+0xd/0x10
  [] _raw_spin_unlock_irq+0x30/0x60
  [] swake_up_all+0xb7/0xe0
  [] rcu_gp_kthread+0xab1/0xeb0
  [] ? trace_hardirqs_on_caller+0xff/0x220
  [] ? _raw_spin_unlock_irq+0x41/0x60
  [] ? rcu_barrier+0x20/0x20
  [] kthread+0x104/0x120
  [] ? _raw_spin_unlock_irq+0x30/0x60
  [] ? kthread_create_on_node+0x260/0x260
  [] ret_from_fork+0x3f/0x70
  [] ? kthread_create_on_node+0x260/0x260

Signed-off-by: Daniel Wagner 
Acked-by: Peter Zijlstra (Intel) 
Cc: linux-rt-users@vger.kernel.org
Cc: Boqun Feng 
Cc: Marcelo Tosatti 
Cc: Steven Rostedt 
Cc: Paul Gortmaker 
Cc: Paolo Bonzini 
Cc: "Paul E. McKenney" 
Link: http://lkml.kernel.org/r/1455871601-27484-5-git-send-email-wagi@monom.org
Signed-off-by: Thomas Gleixner