linux-stable.git/kernel/rcu/tree.h, branch v5.16.2

Merge branches 'bitmaprange.2021.05.10c', 'doc.2021.05.10c', 'fixes.2021.05.13a', 'kvfree_rcu.2021.05.10c', 'mmdumpobj.2021.05.10c', 'nocb.2021.05.12a', 'srcu.2021.05.12a', 'tasks.2021.05.18a' and 'torture.2021.05.10c' into HEAD

2021-05-18T17:56:19+00:00

bitmaprange.2021.05.10c: Allow "all" for bitmap ranges.
doc.2021.05.10c: Documentation updates.
fixes.2021.05.13a: Miscellaneous fixes.
kvfree_rcu.2021.05.10c: kvfree_rcu() updates.
mmdumpobj.2021.05.10c: mem_dump_obj() updates.
nocb.2021.05.12a: RCU NOCB CPU updates, including limited deoffloading.
srcu.2021.05.12a: SRCU updates.
tasks.2021.05.18a: Tasks-RCU updates.
torture.2021.05.10c: Torture-test updates.

rcu: Fix various typos in comments

2021-05-12T19:11:05+00:00

Fix ~12 single-word typos in RCU code comments.

[ paulmck: Apply feedback from Randy Dunlap. ]
Reviewed-by: Randy Dunlap 
Signed-off-by: Ingo Molnar 
Signed-off-by: Paul E. McKenney

rcu/nocb: Unify timers

2021-05-12T19:10:23+00:00

Now that ->nocb_timer and ->nocb_bypass_timer have become quite similar,
this commit merges them together.  A new RCU_NOCB_WAKE_BYPASS wake level
is introduced.  As a result, timers perform all kinds of deferred wake
ups but other deferred wakeup callsites only handle non-bypass wakeups
in order not to wake up rcuo too early.

The timer also unconditionally executes a full barrier so as to order
timer_pending() and callback enqueue although the path performing
RCU_NOCB_WAKE_FORCE that makes use of it is debatable. It should also
test against the rdp leader instead of the current rdp.

This unconditional full barrier shouldn't bring visible overhead since
these timers almost never fire.

Signed-off-by: Frederic Weisbecker 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Paul E. McKenney

rcu/nocb: Prepare for fine-grained deferred wakeup

2021-05-12T19:10:23+00:00

Tuning the deferred wakeup level must be done from a safe wakeup
point. Currently those sites are:

* ->nocb_timer
* user/idle/guest entry
* CPU down
* softirq/rcuc

All of these sites perform the wake up for both RCU_NOCB_WAKE and
RCU_NOCB_WAKE_FORCE.

In order to merge ->nocb_timer and ->nocb_bypass_timer together, we plan
to add a new RCU_NOCB_WAKE_BYPASS that really should be deferred until
a timer fires so that we don't wake up the NOCB-gp kthread too early.

To prepare for that, this commit specifies the per-callsite wakeup
level/limit.

Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
[ paulmck: Fix non-NOCB rcu_nocb_need_deferred_wakeup() definition. ]
Signed-off-by: Paul E. McKenney

rcu: Make RCU priority boosting work on single-CPU rcu_node structures

2021-05-10T23:22:54+00:00

When any CPU comes online, it checks to see if an RCU-boost kthread has
already been created for that CPU's leaf rcu_node structure, and if
not, it creates one.  Unfortunately, it also verifies that this leaf
rcu_node structure actually has at least one online CPU, and if not,
it declines to create the kthread.  Although this behavior makes sense
during early boot, especially on systems that claim far more CPUs than
they actually have, it makes no sense for the first CPU to come online
for a given rcu_node structure.  There is no point in checking because
we know there is a CPU on its way in.

The problem is that timing differences can cause this incoming CPU to not
yet be reflected in the various bit masks even at rcutree_online_cpu()
time, and there is no chance at rcutree_prepare_cpu() time.  Plus it
would be better to create the RCU-boost kthread at rcutree_prepare_cpu()
to handle the case where the CPU is involved in an RCU priority inversion
very shortly after it comes online.

This commit therefore moves the checking to rcu_prepare_kthreads(), which
is called only at early boot, when the check is appropriate.  In addition,
it makes rcutree_prepare_cpu() invoke rcu_spawn_one_boost_kthread(), which
no longer does any checking for online CPUs.

With this change, RCU priority boosting tests now pass for short rcutorture
runs, even with single-CPU leaf rcu_node structures.

Cc: Sebastian Andrzej Siewior 
Cc: Scott Wood 
Cc: Thomas Gleixner 
Signed-off-by: Paul E. McKenney

rcu: Add quiescent states and boost states to show_rcu_gp_kthreads() output

2021-05-10T23:22:54+00:00

This commit adds each rcu_node structure's ->qsmask and "bBEG" output
indicating whether: (1) There is a boost kthread, (2) A reader needs
to be (or is in the process of being) boosted, (3) A reader is blocking
an expedited grace period, and (4) A reader is blocking a normal grace
period.  This helps diagnose RCU priority boosting failures.

Signed-off-by: Paul E. McKenney

rcu/nocb: Use the rcuog CPU's ->nocb_timer

2021-05-10T23:02:44+00:00

Currently each CPU has its own ->nocb_timer queued when the nocb_gp
wakeup must be deferred.  This approach has many drawbacks, compared to
a solution based on a single timer per NOCB group:

* There are a lot of timers to maintain.

* The per-rdp ->nocb_lock must be held to queue and cancel the timer
  and this lock can already be heavily contended.

* One timer firing doesn't cancel the other timers in the same group:
  - These other timers can thus cause spurious wakeups
  - Each rdp that queued a timer must lock both ->nocb_lock and then
    ->nocb_gp_lock upon exit from the kernel to idle/user/guest mode.

* We can't cancel all of them if we detect an unflushed bypass in
  nocb_gp_wait(). In fact currently we only ever cancel the ->nocb_timer
  of the leader group.

* The leader group's nocb_timer is cancelled without locking ->nocb_lock
  in nocb_gp_wait().  This currently appears to be safe but is an
  accident waiting to happen.

* Since the timer acquires ->nocb_lock, it requires extra care in the
  NOCB (de-)offloading process, requiring that it be either enabled or
  disabled and then flushed.

This commit instead uses the rcuog kthread's CPU's ->nocb_timer instead.
It is protected by nocb_gp_lock, which is _way_ less contended and
remains so even after this change.  As a matter of fact, the nocb_timer
almost never fires and the deferred wakeup is mostly carried out upon
idle/user/guest entry.  Now the early check performed at this point in
do_nocb_deferred_wakeup() is done on rdp_gp->nocb_defer_wakeup, which
is of course racy.  However, this raciness is harmless because we only
need the guarantee that the timer is queued if we were the last one to
queue it.  Any other situation (another CPU has queued it and we either
see it or not) is fine.

This solves all the issues listed above.

Signed-off-by: Frederic Weisbecker 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Paul E. McKenney

Merge tag 'sched-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2021-02-21T20:35:04+00:00

Pull scheduler updates from Ingo Molnar:
 "Core scheduler updates:

   - Add CONFIG_PREEMPT_DYNAMIC: this in its current form adds the
     preempt=none/voluntary/full boot options (default: full), to allow
     distros to build a PREEMPT kernel but fall back to close to
     PREEMPT_VOLUNTARY (or PREEMPT_NONE) runtime scheduling behavior via
     a boot time selection.

     There's also the /debug/sched_debug switch to do this runtime.

     This feature is implemented via runtime patching (a new variant of
     static calls).

     The scope of the runtime patching can be best reviewed by looking
     at the sched_dynamic_update() function in kernel/sched/core.c.

     ( Note that the dynamic none/voluntary mode isn't 100% identical,
       for example preempt-RCU is available in all cases, plus the
       preempt count is maintained in all models, which has runtime
       overhead even with the code patching. )

     The PREEMPT_VOLUNTARY/PREEMPT_NONE models, used by the vast
     majority of distributions, are supposed to be unaffected.

   - Fix ignored rescheduling after rcu_eqs_enter(). This is a bug that
     was found via rcutorture triggering a hang. The bug is that
     rcu_idle_enter() may wake up a NOCB kthread, but this happens after
     the last generic need_resched() check. Some cpuidle drivers fix it
     by chance but many others don't.

     In true 2020 fashion the original bug fix has grown into a 5-patch
     scheduler/RCU fix series plus another 16 RCU patches to address the
     underlying issue of missed preemption events. These are the initial
     fixes that should fix current incarnations of the bug.

   - Clean up rbtree usage in the scheduler, by providing & using the
     following consistent set of rbtree APIs:

       partial-order; less() based:
         - rb_add(): add a new entry to the rbtree
         - rb_add_cached(): like rb_add(), but for a rb_root_cached

       total-order; cmp() based:
         - rb_find(): find an entry in an rbtree
         - rb_find_add(): find an entry, and add if not found

         - rb_find_first(): find the first (leftmost) matching entry
         - rb_next_match(): continue from rb_find_first()
         - rb_for_each(): iterate a sub-tree using the previous two

   - Improve the SMP/NUMA load-balancer: scan for an idle sibling in a
     single pass. This is a 4-commit series where each commit improves
     one aspect of the idle sibling scan logic.

   - Improve the cpufreq cooling driver by getting the effective CPU
     utilization metrics from the scheduler

   - Improve the fair scheduler's active load-balancing logic by
     reducing the number of active LB attempts & lengthen the
     load-balancing interval. This improves stress-ng mmapfork
     performance.

   - Fix CFS's estimated utilization (util_est) calculation bug that can
     result in too high utilization values

  Misc updates & fixes:

   - Fix the HRTICK reprogramming & optimization feature

   - Fix SCHED_SOFTIRQ raising race & warning in the CPU offlining code

   - Reduce dl_add_task_root_domain() overhead

   - Fix uprobes refcount bug

   - Process pending softirqs in flush_smp_call_function_from_idle()

   - Clean up task priority related defines, remove *USER_*PRIO and
     USER_PRIO()

   - Simplify the sched_init_numa() deduplication sort

   - Documentation updates

   - Fix EAS bug in update_misfit_status(), which degraded the quality
     of energy-balancing

   - Smaller cleanups"

* tag 'sched-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
  sched,x86: Allow !PREEMPT_DYNAMIC
  entry/kvm: Explicitly flush pending rcuog wakeup before last rescheduling point
  entry: Explicitly flush pending rcuog wakeup before last rescheduling point
  rcu/nocb: Trigger self-IPI on late deferred wake up before user resume
  rcu/nocb: Perform deferred wake up before last idle's need_resched() check
  rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers
  sched/features: Distinguish between NORMAL and DEADLINE hrtick
  sched/features: Fix hrtick reprogramming
  sched/deadline: Reduce rq lock contention in dl_add_task_root_domain()
  uprobes: (Re)add missing get_uprobe() in __find_uprobe()
  smp: Process pending softirqs in flush_smp_call_function_from_idle()
  sched: Harden PREEMPT_DYNAMIC
  static_call: Allow module use without exposing static_call_key
  sched: Add /debug/sched_preempt
  preempt/dynamic: Support dynamic preempt with preempt= boot option
  preempt/dynamic: Provide irqentry_exit_cond_resched() static call
  preempt/dynamic: Provide preempt_schedule[_notrace]() static calls
  preempt/dynamic: Provide cond_resched() and might_resched() static calls
  preempt: Introduce CONFIG_PREEMPT_DYNAMIC
  static_call: Provide DEFINE_STATIC_CALL_RET0()
  ...

rcu/nocb: Trigger self-IPI on late deferred wake up before user resume

2021-02-17T13:12:43+00:00

Entering RCU idle mode may cause a deferred wake up of an RCU NOCB_GP
kthread (rcuog) to be serviced.

Unfortunately the call to rcu_user_enter() is already past the last
rescheduling opportunity before we resume to userspace or to guest mode.
We may escape there with the woken task ignored.

The ultimate resort to fix every callsites is to trigger a self-IPI
(nohz_full depends on arch to implement arch_irq_work_raise()) that will
trigger a reschedule on IRQ tail or guest exit.

Eventually every site that want a saner treatment will need to carefully
place a call to rcu_nocb_flush_deferred_wakeup() before the last explicit
need_resched() check upon resume.

Fixes: 96d3fd0d315a (rcu: Break call_rcu() deadlock involving scheduler and perf)
Reported-by: Paul E. McKenney 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-4-frederic@kernel.org

rcu/nocb: Shutdown nocb timer on de-offloading

2021-01-07T00:24:59+00:00

This commit ensures that the nocb timer is shut down before reaching the
final de-offloaded state.  The key goal is to prevent the timer handler
from manipulating the callbacks without the protection of the nocb locks.

Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Thomas Gleixner 
Inspired-by: Paul E. McKenney 
Tested-by: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney