linux-stable.git/kernel/rcu/tree.c, branch v3.14.78

rcu: Use rcu_gp_kthread_wake() to wake up grace period kthreads

2014-11-21T17:23:05+00:00

commit 2aa792e6faf1a00f5accf1f69e87e11a390ba2cd upstream.

The rcu_gp_kthread_wake() function checks for three conditions before
waking up grace period kthreads:

*  Is the thread we are trying to wake up the current thread?
*  Are the gp_flags zero? (all threads wait on non-zero gp_flags condition)
*  Is there no thread created for this flavour, hence nothing to wake up?

If any one of these condition is true, we do not call wake_up().
It was found that there are quite a few avoidable wake ups both during
idle time and under stress induced by rcutorture.

Idle:

Total:66000, unnecessary:66000, case1:61827, case2:66000, case3:0
Total:68000, unnecessary:68000, case1:63696, case2:68000, case3:0

rcutorture:

Total:254000, unnecessary:254000, case1:199913, case2:254000, case3:0
Total:256000, unnecessary:256000, case1:201784, case2:256000, case3:0

Here case{1-3} are the cases listed above. We can avoid these wake
ups by using rcu_gp_kthread_wake() to conditionally wake up the grace
period kthreads.

There is a comment about an implied barrier supplied by the wake_up()
logic.  This barrier is necessary for the awakened thread to see the
updated ->gp_flags.  This flag is always being updated with the root node
lock held. Also, the awakened thread tries to acquire the root node lock
before reading ->gp_flags because of which there is proper ordering.

Hence this commit tries to avoid calling wake_up() whenever we can by
using rcu_gp_kthread_wake() function.

Signed-off-by: Pranith Kumar 
CC: Mathieu Desnoyers 
Signed-off-by: Paul E. McKenney 
Cc: Kamal Mostafa 
Signed-off-by: Greg Kroah-Hartman

rcu: Make callers awaken grace-period kthread

2014-11-21T17:23:05+00:00

commit 48a7639ce80cf279834d0d44865e49ecd714f37d upstream.

The rcu_start_gp_advanced() function currently uses irq_work_queue()
to defer wakeups of the RCU grace-period kthread.  This deferring
is necessary to avoid RCU-scheduler deadlocks involving the rcu_node
structure's lock, meaning that RCU cannot call any of the scheduler's
wake-up functions while holding one of these locks.

Unfortunately, the second and subsequent calls to irq_work_queue() are
ignored, and the first call will be ignored (aside from queuing the work
item) if the scheduler-clock tick is turned off.  This is OK for many
uses, especially those where irq_work_queue() is called from an interrupt
or softirq handler, because in those cases the scheduler-clock-tick state
will be re-evaluated, which will turn the scheduler-clock tick back on.
On the next tick, any deferred work will then be processed.

However, this strategy does not always work for RCU, which can be invoked
at process level from idle CPUs.  In this case, the tick might never
be turned back on, indefinitely defering a grace-period start request.
Note that the RCU CPU stall detector cannot see this condition, because
there is no RCU grace period in progress.  Therefore, we can (and do!)
see long tens-of-seconds stalls in grace-period handling.  In theory,
we could see a full grace-period hang, but rcutorture testing to date
has seen only the tens-of-seconds stalls.  Event tracing demonstrates
that irq_work_queue() is being called repeatedly to no effect during
these stalls: The "newreq" event appears repeatedly from a task that is
not one of the grace-period kthreads.

In theory, irq_work_queue() might be fixed to avoid this sort of issue,
but RCU's requirements are unusual and it is quite straightforward to pass
wake-up responsibility up through RCU's call chain, so that the wakeup
happens when the offending locks are released.

This commit therefore makes this change.  The rcu_start_gp_advanced(),
rcu_start_future_gp(), rcu_accelerate_cbs(), rcu_advance_cbs(),
__note_gp_changes(), and rcu_start_gp() functions now return a boolean
which indicates when a wake-up is needed.  A new rcu_gp_kthread_wake()
does the wakeup when it is necessary and safe to do so: No self-wakes,
no wake-ups if the ->gp_flags field indicates there is no need (as in
someone else did the wake-up before we got around to it), and no wake-ups
before the grace-period kthread has been created.

Signed-off-by: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Cc: Frederic Weisbecker 
Reviewed-by: Josh Triplett 
[ Pranith: backport to 3.13-stable: just rcu_gp_kthread_wake(),
  prereq for 2aa792e "rcu: Use rcu_gp_kthread_wake() to wake up grace
  period kthreads" ]
Signed-off-by: Pranith Kumar 
Signed-off-by: Kamal Mostafa 
Signed-off-by: Greg Kroah-Hartman

Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2014-01-20T18:25:12+00:00

Pull RCU updates from Ingo Molnar:
 - add RCU torture scripts/tooling
 - static analysis improvements
 - update RCU documentation
 - miscellaneous fixes

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
  rcu: Remove "extern" from function declarations in kernel/rcu/rcu.h
  rcu: Remove "extern" from function declarations in include/linux/*rcu*.h
  rcu/torture: Dynamically allocate SRCU output buffer to avoid overflow
  rcu: Don't activate RCU core on NO_HZ_FULL CPUs
  rcu: Warn on allegedly impossible rcu_read_unlock_special() from irq
  rcu: Add an RCU_INITIALIZER for global RCU-protected pointers
  rcu: Make rcu_assign_pointer's assignment volatile and type-safe
  bonding: Use RCU_INIT_POINTER() for better overhead and for sparse
  rcu: Add comment on evaluate-once properties of rcu_assign_pointer().
  rcu: Provide better diagnostics for blocking in RCU callback functions
  rcu: Improve SRCU's grace-period comments
  rcu: Fix CONFIG_RCU_FANOUT_EXACT for odd fanout/leaf values
  rcu: Fix coccinelle warnings
  rcutorture: Stop tracking FSF's postal address
  rcutorture: Move checkarg to functions.sh
  rcutorture: Flag errors and warnings with color coding
  rcutorture: Record results from repeated runs of the same test scenario
  rcutorture: Test summary at end of run with less chattiness
  rcutorture: Update comment in kvm.sh listing typical RCU trace events
  rcutorture: Add tracing-enabled version of TREE08
  ...

rcu: Apply smp_mb__after_unlock_lock() to preserve grace periods

2013-12-16T10:36:16+00:00

RCU must ensure that there is the equivalent of a full memory
barrier between any memory access preceding grace period and any
memory access following that same grace period, regardless of
which CPU(s) happen to execute the two memory accesses.
Therefore, downgrading UNLOCK+LOCK to no longer imply a full
memory barrier requires some adjustments to RCU.

This commit therefore adds smp_mb__after_unlock_lock()
invocations as needed after the RCU lock acquisitions that need
to be part of a full-memory-barrier UNLOCK+LOCK.

Signed-off-by: Paul E. McKenney 
Reviewed-by: Peter Zijlstra 
Cc: 
Cc: Linus Torvalds 
Cc: Andrew Morton 
Link: http://lkml.kernel.org/r/1386799151-2219-7-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar

rcu: Don't activate RCU core on NO_HZ_FULL CPUs

2013-12-12T20:34:15+00:00

Whenever a CPU receives a scheduling-clock interrupt, RCU checks to see
if the RCU core needs anything from this CPU.  If so, RCU raises
RCU_SOFTIRQ to carry out any needed processing.

This approach has worked well historically, but it is undesirable on
NO_HZ_FULL CPUs.  Such CPUs are expected to spend almost all of their time
in userspace, so that scheduling-clock interrupts can be disabled while
there is only one runnable task on the CPU in question.  Unfortunately,
raising any softirq has the potential to wake up ksoftirqd, which would
provide the second runnable task on that CPU, preventing disabling of
scheduling-clock interrupts.

What is needed instead is for RCU to leave NO_HZ_FULL CPUs alone,
relying on the grace-period kthreads' quiescent-state forcing to
do any needed RCU work on behalf of those CPUs.

This commit therefore refrains from raising RCU_SOFTIRQ on any
NO_HZ_FULL CPUs during any grace periods that have been in effect
for less than one second.  The one-second limit handles the case
where an inappropriate workload is running on a NO_HZ_FULL CPU
that features lots of scheduling-clock interrupts, but no idle
or userspace time.

Reported-by: Mike Galbraith 
Signed-off-by: Paul E. McKenney 
Tested-by: Mike Galbraith 
Toasted-by: Frederic Weisbecker

rcu: Fix CONFIG_RCU_FANOUT_EXACT for odd fanout/leaf values

2013-12-09T23:12:38+00:00

Each element of the rcu_state structure's ->levelspread[] array
is intended to contain the per-level fanout, where the zero-th
element corresponds to the root of the rcu_node tree, and the last
element corresponds to the leaves.  In the CONFIG_RCU_FANOUT_EXACT
case, this means that the last element should be filled in
from CONFIG_RCU_FANOUT_LEAF (or from the rcu_fanout_leaf boot
parameter, if provided) and that the remaining elements should
be filled in from CONFIG_RCU_FANOUT.  Unfortunately, the current
code in rcu_init_levelspread() takes the opposite approach, placing
CONFIG_RCU_FANOUT_LEAF in the zero-th element and CONFIG_RCU_FANOUT in
the remaining elements.

For typical power-of-two values, this generates odd but functional
rcu_node trees.  However, other values, for example CONFIG_RCU_FANOUT=3
and CONFIG_RCU_FANOUT_LEAF=2, generate trees that can leave some CPUs
out of the grace-period computation, resulting in too-short grace periods
and therefore a broken RCU implementation.

This commit therefore fixes rcu_init_levelspread() to set the last
->levelspread[] array element from CONFIG_RCU_FANOUT_LEAF and the
remaining elements from CONFIG_RCU_FANOUT, thus generating the
intended rcu_node trees.

Signed-off-by: Paul E. McKenney

rcu: Fix coccinelle warnings

2013-12-09T23:12:25+00:00

This commit fixes the following coccinelle warning:

kernel/rcu/tree.c:712:9-10: WARNING: return of 0/1 in function
'rcu_lockdep_current_cpu_online' with return type bool

Return statements in functions returning bool should use
 true/false instead of 1/0.
 Generated by: coccinelle/misc/boolreturn.cocci

Signed-off-by: Fengguang Wu 
Signed-off-by: Paul E. McKenney

rcu: Let the world know when RCU adjusts its geometry

2013-12-03T18:10:19+00:00

Some RCU bugs have been specific to the layout of the rcu_node tree,
but RCU will silently adjust the tree at boot time if appropriate.
This obscures valuable debugging information, so print a message when
this happens.

Signed-off-by: Paul E. McKenney

rcu: Allow task-level idle entry/exit nesting

2013-12-03T18:10:19+00:00

The current task-level idle entry/exit code forces an entry/exit on
each call, regardless of the nesting level.  This commit therefore
properly accounts for nesting.

Signed-off-by: Paul E. McKenney 
Reviewed-by: Frederic Weisbecker

rcu: Break call_rcu() deadlock involving scheduler and perf

2013-12-03T18:10:18+00:00

Dave Jones got the following lockdep splat:

>  ======================================================
>  [ INFO: possible circular locking dependency detected ]
>  3.12.0-rc3+ #92 Not tainted
>  -------------------------------------------------------
>  trinity-child2/15191 is trying to acquire lock:
>   (&rdp->nocb_wq){......}, at: [] __wake_up+0x23/0x50
>
> but task is already holding lock:
>   (&ctx->lock){-.-...}, at: [] perf_event_exit_task+0x109/0x230
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&ctx->lock){-.-...}:
>         [] lock_acquire+0x93/0x200
>         [] _raw_spin_lock+0x40/0x80
>         [] __perf_event_task_sched_out+0x2df/0x5e0
>         [] perf_event_task_sched_out+0x93/0xa0
>         [] __schedule+0x1d2/0xa20
>         [] preempt_schedule_irq+0x50/0xb0
>         [] retint_kernel+0x26/0x30
>         [] tty_flip_buffer_push+0x34/0x50
>         [] pty_write+0x54/0x60
>         [] n_tty_write+0x32d/0x4e0
>         [] tty_write+0x158/0x2d0
>         [] vfs_write+0xc0/0x1f0
>         [] SyS_write+0x4c/0xa0
>         [] tracesys+0xdd/0xe2
>
> -> #2 (&rq->lock){-.-.-.}:
>         [] lock_acquire+0x93/0x200
>         [] _raw_spin_lock+0x40/0x80
>         [] wake_up_new_task+0xc2/0x2e0
>         [] do_fork+0x126/0x460
>         [] kernel_thread+0x26/0x30
>         [] rest_init+0x23/0x140
>         [] start_kernel+0x3f6/0x403
>         [] x86_64_start_reservations+0x2a/0x2c
>         [] x86_64_start_kernel+0xf1/0xf4
>
> -> #1 (&p->pi_lock){-.-.-.}:
>         [] lock_acquire+0x93/0x200
>         [] _raw_spin_lock_irqsave+0x4b/0x90
>         [] try_to_wake_up+0x31/0x350
>         [] default_wake_function+0x12/0x20
>         [] autoremove_wake_function+0x18/0x40
>         [] __wake_up_common+0x58/0x90
>         [] __wake_up+0x39/0x50
>         [] __call_rcu_nocb_enqueue+0xa8/0xc0
>         [] __call_rcu+0x140/0x820
>         [] call_rcu+0x1d/0x20
>         [] cpu_attach_domain+0x287/0x360
>         [] build_sched_domains+0xe5e/0x10a0
>         [] sched_init_smp+0x3b7/0x47a
>         [] kernel_init_freeable+0xf6/0x202
>         [] kernel_init+0xe/0x190
>         [] ret_from_fork+0x7c/0xb0
>
> -> #0 (&rdp->nocb_wq){......}:
>         [] __lock_acquire+0x191a/0x1be0
>         [] lock_acquire+0x93/0x200
>         [] _raw_spin_lock_irqsave+0x4b/0x90
>         [] __wake_up+0x23/0x50
>         [] __call_rcu_nocb_enqueue+0xa8/0xc0
>         [] __call_rcu+0x140/0x820
>         [] kfree_call_rcu+0x20/0x30
>         [] put_ctx+0x4f/0x70
>         [] perf_event_exit_task+0x12e/0x230
>         [] do_exit+0x30d/0xcc0
>         [] do_group_exit+0x4c/0xc0
>         [] SyS_exit_group+0x14/0x20
>         [] tracesys+0xdd/0xe2
>
> other info that might help us debug this:
>
> Chain exists of:
>   &rdp->nocb_wq --> &rq->lock --> &ctx->lock
>
>   Possible unsafe locking scenario:
>
>         CPU0                    CPU1
>         ----                    ----
>    lock(&ctx->lock);
>                                 lock(&rq->lock);
>                                 lock(&ctx->lock);
>    lock(&rdp->nocb_wq);
>
>  *** DEADLOCK ***
>
> 1 lock held by trinity-child2/15191:
>  #0:  (&ctx->lock){-.-...}, at: [] perf_event_exit_task+0x109/0x230
>
> stack backtrace:
> CPU: 2 PID: 15191 Comm: trinity-child2 Not tainted 3.12.0-rc3+ #92
>  ffffffff82565b70 ffff880070c2dbf8 ffffffff8172a363 ffffffff824edf40
>  ffff880070c2dc38 ffffffff81726741 ffff880070c2dc90 ffff88022383b1c0
>  ffff88022383aac0 0000000000000000 ffff88022383b188 ffff88022383b1c0
> Call Trace:
>  [] dump_stack+0x4e/0x82
>  [] print_circular_bug+0x200/0x20f
>  [] __lock_acquire+0x191a/0x1be0
>  [] ? get_lock_stats+0x19/0x60
>  [] ? native_sched_clock+0x24/0x80
>  [] lock_acquire+0x93/0x200
>  [] ? __wake_up+0x23/0x50
>  [] _raw_spin_lock_irqsave+0x4b/0x90
>  [] ? __wake_up+0x23/0x50
>  [] __wake_up+0x23/0x50
>  [] __call_rcu_nocb_enqueue+0xa8/0xc0
>  [] __call_rcu+0x140/0x820
>  [] ? local_clock+0x3f/0x50
>  [] kfree_call_rcu+0x20/0x30
>  [] put_ctx+0x4f/0x70
>  [] perf_event_exit_task+0x12e/0x230
>  [] do_exit+0x30d/0xcc0
>  [] ? trace_hardirqs_on_caller+0x115/0x1e0
>  [] ? trace_hardirqs_on+0xd/0x10
>  [] do_group_exit+0x4c/0xc0
>  [] SyS_exit_group+0x14/0x20
>  [] tracesys+0xdd/0xe2

The underlying problem is that perf is invoking call_rcu() with the
scheduler locks held, but in NOCB mode, call_rcu() will with high
probability invoke the scheduler -- which just might want to use its
locks.  The reason that call_rcu() needs to invoke the scheduler is
to wake up the corresponding rcuo callback-offload kthread, which
does the job of starting up a grace period and invoking the callbacks
afterwards.

One solution (championed on a related problem by Lai Jiangshan) is to
simply defer the wakeup to some point where scheduler locks are no longer
held.  Since we don't want to unnecessarily incur the cost of such
deferral, the task before us is threefold:

1.	Determine when it is likely that a relevant scheduler lock is held.

2.	Defer the wakeup in such cases.

3.	Ensure that all deferred wakeups eventually happen, preferably
	sooner rather than later.

We use irqs_disabled_flags() as a proxy for relevant scheduler locks
being held.  This works because the relevant locks are always acquired
with interrupts disabled.  We may defer more often than needed, but that
is at least safe.

The wakeup deferral is tracked via a new field in the per-CPU and
per-RCU-flavor rcu_data structure, namely ->nocb_defer_wakeup.

This flag is checked by the RCU core processing.  The __rcu_pending()
function now checks this flag, which causes rcu_check_callbacks()
to initiate RCU core processing at each scheduling-clock interrupt
where this flag is set.  Of course this is not sufficient because
scheduling-clock interrupts are often turned off (the things we used to
be able to count on!).  So the flags are also checked on entry to any
state that RCU considers to be idle, which includes both NO_HZ_IDLE idle
state and NO_HZ_FULL user-mode-execution state.

This approach should allow call_rcu() to be invoked regardless of what
locks you might be holding, the key word being "should".

Reported-by: Dave Jones 
Signed-off-by: Paul E. McKenney 
Cc: Peter Zijlstra