linux-stable.git/kernel/rcutree.c, branch v3.12.52

rcu: Reject memory-order-induced stall-warning false positives

2015-10-07T07:24:51+00:00

commit 26cdfedf6a902345f8604ea8e0b7dd2566b37a46 upstream.

If a system is idle from an RCU perspective for longer than specified
by CONFIG_RCU_CPU_STALL_TIMEOUT, and if one CPU starts a grace period
just as a second checks for CPU stalls, and if this second CPU happens
to see the old value of rsp->jiffies_stall, it will incorrectly report a
CPU stall.  This is quite rare, but apparently occurs deterministically
on systems with about 6TB of memory.

This commit therefore orders accesses to the data used to determine
whether or not a CPU stall is in progress.  Grace-period initialization
and cleanup first increments rsp->completed to mark the end of the
previous grace period, then records the current jiffies in rsp->gp_start,
then records the jiffies at which a stall can be expected to occur in
rsp->jiffies_stall, and finally increments rsp->gpnum to mark the start
of the new grace period.  Now, this ordering by itself does not prevent
false positives.  For example, if grace-period initialization was delayed
between recording rsp->gp_start and rsp->jiffies_stall, the CPU stall
warning code might still see an old value of rsp->jiffies_stall.

Therefore, this commit also orders the CPU stall warning accesses as
well, loading rsp->gpnum and jiffies, then rsp->jiffies_stall, then
rsp->gp_start, and finally rsp->completed.  This ordering means that
the false-positive scenario in the previous paragraph would result
in rsp->completed being greater than or equal to rsp->gpnum, which is
never valid for a CPU stall, allowing the false positive to be rejected.
Furthermore, any fetch that gets an old value of rsp->jiffies_stall
must also get an old value of rsp->gpnum, which will again be rejected
by the comparison of rsp->gpnum and rsp->completed.  Situations where
rsp->gp_start is later than rsp->jiffies_stall are also rejected, as
are situations where jiffies is less than rsp->jiffies_stall.

Although use of unsynchronized accesses means that there are likely
still some false-positive scenarios (synchronization has proven to be
a very bad idea on large systems), this should get rid of a large class
of these scenarios.

Reported-by: Fabian Herschel 
Reported-by: Michal Hocko 
Signed-off-by: Paul E. McKenney 
Reviewed-by: Michal Hocko 
Tested-by: Jochen Striepe 
Signed-off-by: Jiri Slaby

rcu: Use rcu_gp_kthread_wake() to wake up grace period kthreads

2014-11-19T17:38:26+00:00

commit 2aa792e6faf1a00f5accf1f69e87e11a390ba2cd upstream.

The rcu_gp_kthread_wake() function checks for three conditions before
waking up grace period kthreads:

*  Is the thread we are trying to wake up the current thread?
*  Are the gp_flags zero? (all threads wait on non-zero gp_flags condition)
*  Is there no thread created for this flavour, hence nothing to wake up?

If any one of these condition is true, we do not call wake_up().
It was found that there are quite a few avoidable wake ups both during
idle time and under stress induced by rcutorture.

Idle:

Total:66000, unnecessary:66000, case1:61827, case2:66000, case3:0
Total:68000, unnecessary:68000, case1:63696, case2:68000, case3:0

rcutorture:

Total:254000, unnecessary:254000, case1:199913, case2:254000, case3:0
Total:256000, unnecessary:256000, case1:201784, case2:256000, case3:0

Here case{1-3} are the cases listed above. We can avoid these wake
ups by using rcu_gp_kthread_wake() to conditionally wake up the grace
period kthreads.

There is a comment about an implied barrier supplied by the wake_up()
logic.  This barrier is necessary for the awakened thread to see the
updated ->gp_flags.  This flag is always being updated with the root node
lock held. Also, the awakened thread tries to acquire the root node lock
before reading ->gp_flags because of which there is proper ordering.

Hence this commit tries to avoid calling wake_up() whenever we can by
using rcu_gp_kthread_wake() function.

Signed-off-by: Pranith Kumar 
CC: Mathieu Desnoyers 
Signed-off-by: Paul E. McKenney 
Cc: Kamal Mostafa 
Signed-off-by: Jiri Slaby

rcu: Make callers awaken grace-period kthread

2014-11-19T17:38:26+00:00

commit 48a7639ce80cf279834d0d44865e49ecd714f37d upstream.

The rcu_start_gp_advanced() function currently uses irq_work_queue()
to defer wakeups of the RCU grace-period kthread.  This deferring
is necessary to avoid RCU-scheduler deadlocks involving the rcu_node
structure's lock, meaning that RCU cannot call any of the scheduler's
wake-up functions while holding one of these locks.

Unfortunately, the second and subsequent calls to irq_work_queue() are
ignored, and the first call will be ignored (aside from queuing the work
item) if the scheduler-clock tick is turned off.  This is OK for many
uses, especially those where irq_work_queue() is called from an interrupt
or softirq handler, because in those cases the scheduler-clock-tick state
will be re-evaluated, which will turn the scheduler-clock tick back on.
On the next tick, any deferred work will then be processed.

However, this strategy does not always work for RCU, which can be invoked
at process level from idle CPUs.  In this case, the tick might never
be turned back on, indefinitely defering a grace-period start request.
Note that the RCU CPU stall detector cannot see this condition, because
there is no RCU grace period in progress.  Therefore, we can (and do!)
see long tens-of-seconds stalls in grace-period handling.  In theory,
we could see a full grace-period hang, but rcutorture testing to date
has seen only the tens-of-seconds stalls.  Event tracing demonstrates
that irq_work_queue() is being called repeatedly to no effect during
these stalls: The "newreq" event appears repeatedly from a task that is
not one of the grace-period kthreads.

In theory, irq_work_queue() might be fixed to avoid this sort of issue,
but RCU's requirements are unusual and it is quite straightforward to pass
wake-up responsibility up through RCU's call chain, so that the wakeup
happens when the offending locks are released.

This commit therefore makes this change.  The rcu_start_gp_advanced(),
rcu_start_future_gp(), rcu_accelerate_cbs(), rcu_advance_cbs(),
__note_gp_changes(), and rcu_start_gp() functions now return a boolean
which indicates when a wake-up is needed.  A new rcu_gp_kthread_wake()
does the wakeup when it is necessary and safe to do so: No self-wakes,
no wake-ups if the ->gp_flags field indicates there is no need (as in
someone else did the wake-up before we got around to it), and no wake-ups
before the grace-period kthread has been created.

Signed-off-by: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Cc: Frederic Weisbecker 
Reviewed-by: Josh Triplett 
[ Pranith: backport to 3.13-stable: just rcu_gp_kthread_wake(),
  prereq for 2aa792e "rcu: Use rcu_gp_kthread_wake() to wake up grace
  period kthreads" ]
Signed-off-by: Pranith Kumar 
Signed-off-by: Kamal Mostafa 
Signed-off-by: Jiri Slaby

Merge branches 'doc.2013.08.19a', 'fixes.2013.08.20a', 'sysidle.2013.08.31a' and 'torture.2013.08.20a' into HEAD

2013-08-31T21:44:45+00:00

doc.2013.08.19a: Documentation updates
fixes.2013.08.20a: Miscellaneous fixes
sysidle.2013.08.31a: Detect system-wide idle state.
torture.2013.08.20a: rcutorture updates.

nohz_full: Force RCU's grace-period kthreads onto timekeeping CPU

2013-08-31T21:44:02+00:00

Because RCU's quiescent-state-forcing mechanism is used to drive the
full-system-idle state machine, and because this mechanism is executed
by RCU's grace-period kthreads, this commit forces these kthreads to
run on the timekeeping CPU (tick_do_timer_cpu).  To do otherwise would
mean that the RCU grace-period kthreads would force the system into
non-idle state every time they drove the state machine, which would
be just a bit on the futile side.

Signed-off-by: Paul E. McKenney 
Cc: Frederic Weisbecker 
Cc: Steven Rostedt 
Cc: Lai Jiangshan 
Reviewed-by: Josh Triplett

nohz_full: Add full-system-idle state machine

2013-08-31T21:43:50+00:00

This commit adds the state machine that takes the per-CPU idle data
as input and produces a full-system-idle indication as output.  This
state machine is driven out of RCU's quiescent-state-forcing
mechanism, which invokes rcu_sysidle_check_cpu() to collect per-CPU
idle state and then rcu_sysidle_report() to drive the state machine.

The full-system-idle state is sampled using rcu_sys_is_idle(), which
also drives the state machine if RCU is idle (and does so by forcing
RCU to become non-idle).  This function returns true if all but the
timekeeping CPU (tick_do_timer_cpu) are idle and have been idle long
enough to avoid memory contention on the full_sysidle_state state
variable.  The rcu_sysidle_force_exit() may be called externally
to reset the state machine back into non-idle state.

For large systems the state machine is driven out of RCU's
force-quiescent-state logic, which provides good scalability at the price
of millisecond-scale latencies on the transition to full-system-idle
state.  This is not so good for battery-powered systems, which are usually
small enough that they don't need to care about scalability, but which
do care deeply about energy efficiency.  Small systems therefore drive
the state machine directly out of the idle-entry code.  The number of
CPUs in a "small" system is defined by a new NO_HZ_FULL_SYSIDLE_SMALL
Kconfig parameter, which defaults to 8.  Note that this is a build-time
definition.

Signed-off-by: Paul E. McKenney 
Cc: Frederic Weisbecker 
Cc: Steven Rostedt 
Cc: Lai Jiangshan 
[ paulmck: Use true and false for boolean constants per Lai Jiangshan. ]
Reviewed-by: Josh Triplett 
[ paulmck: Simplify logic and provide better comments for memory barriers,
  based on review comments and questions by Lai Jiangshan. ]

rcu: Simplify _rcu_barrier() processing

2013-08-20T18:45:50+00:00

This commit drops an unneeded ACCESS_ONCE() and simplifies an "our work
is done" check in _rcu_barrier().  This applies feedback from Linus
(https://lkml.org/lkml/2013/7/26/777) that he gave to similar code
in an unrelated patch.

Signed-off-by: Paul E. McKenney 
Reviewed-by: Josh Triplett 
[ paulmck: Fix comment to match code, reported by Lai Jiangshan. ]

nohz_full: Add full-system-idle arguments to API

2013-08-19T01:59:03+00:00

This commit adds an isidle and jiffies argument to force_qs_rnp(),
dyntick_save_progress_counter(), and rcu_implicit_dynticks_qs() to enable
RCU's force-quiescent-state process to check for full-system idle.

Signed-off-by: Paul E. McKenney 
Cc: Frederic Weisbecker 
Cc: Steven Rostedt 
Cc: Lai Jiangshan 
[ paulmck: Use true and false for boolean constants per Lai Jiangshan. ]
Reviewed-by: Josh Triplett

nohz_full: Add per-CPU idle-state tracking

2013-08-19T01:58:43+00:00

This commit adds the code that updates the rcu_dyntick structure's
new fields to track the per-CPU idle state based on interrupts and
transitions into and out of the idle loop (NMIs are ignored because NMI
handlers cannot cleanly read out the time anyway).  This code is similar
to the code that maintains RCU's idea of per-CPU idleness, but differs
in that RCU treats CPUs running in user mode as idle, where this new
code does not.

Signed-off-by: Paul E. McKenney 
Acked-by: Frederic Weisbecker 
Cc: Steven Rostedt 
Reviewed-by: Josh Triplett

nohz_full: Add rcu_dyntick data for scalable detection of all-idle state

2013-08-19T01:58:31+00:00

This commit adds fields to the rcu_dyntick structure that are used to
detect idle CPUs.  These new fields differ from the existing ones in
that the existing ones consider a CPU executing in user mode to be idle,
where the new ones consider CPUs executing in user mode to be busy.
The handling of these new fields is otherwise quite similar to that for
the exiting fields.  This commit also adds the initialization required
for these fields.

So, why is usermode execution treated differently, with RCU considering
it a quiescent state equivalent to idle, while in contrast the new
full-system idle state detection considers usermode execution to be
non-idle?

It turns out that although one of RCU's quiescent states is usermode
execution, it is not a full-system idle state.  This is because the
purpose of the full-system idle state is not RCU, but rather determining
when accurate timekeeping can safely be disabled.  Whenever accurate
timekeeping is required in a CONFIG_NO_HZ_FULL kernel, at least one
CPU must keep the scheduling-clock tick going.  If even one CPU is
executing in user mode, accurate timekeeping is requires, particularly for
architectures where gettimeofday() and friends do not enter the kernel.
Only when all CPUs are really and truly idle can accurate timekeeping be
disabled, allowing all CPUs to turn off the scheduling clock interrupt,
thus greatly improving energy efficiency.

This naturally raises the question "Why is this code in RCU rather than in
timekeeping?", and the answer is that RCU has the data and infrastructure
to efficiently make this determination.

Signed-off-by: Paul E. McKenney 
Acked-by: Frederic Weisbecker 
Cc: Steven Rostedt 
Reviewed-by: Josh Triplett