linux-stable.git/kernel/sched, branch v4.15

sched/core: Fix cpu.max vs. cpuhotplug deadlock

2018-01-24T09:03:44+00:00

Tejun reported the following cpu-hotplug lock (percpu-rwsem) read recursion:

  tg_set_cfs_bandwidth()
    get_online_cpus()
      cpus_read_lock()

    cfs_bandwidth_usage_inc()
      static_key_slow_inc()
        cpus_read_lock()

Reported-by: Tejun Heo 
Tested-by: Tejun Heo 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/20180122215328.GP3397@worktop
Signed-off-by: Ingo Molnar

delayacct: Account blkio completion on the correct task

2018-01-16T02:29:36+00:00

Before commit:

  e33a9bba85a8 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")

delayacct_blkio_end() was called after context-switching into the task which
completed I/O.

This resulted in double counting: the task would account a delay both waiting
for I/O and for time spent in the runqueue.

With e33a9bba85a8, delayacct_blkio_end() is called by try_to_wake_up().
In ttwu, we have not yet context-switched. This is more correct, in that
the delay accounting ends when the I/O is complete.

But delayacct_blkio_end() relies on 'get_current()', and we have not yet
context-switched into the task whose I/O completed. This results in the
wrong task having its delay accounting statistics updated.

Instead of doing that, pass the task_struct being woken to delayacct_blkio_end(),
so that it can update the statistics of the correct task.

Signed-off-by: Josh Snyder 
Acked-by: Tejun Heo 
Acked-by: Balbir Singh 
Cc: 
Cc: Brendan Gregg 
Cc: Jens Axboe 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-block@vger.kernel.org
Fixes: e33a9bba85a8 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
Link: http://lkml.kernel.org/r/1513613712-571-1-git-send-email-joshs@netflix.com
Signed-off-by: Ingo Molnar

Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2018-01-12T18:23:59+00:00

Pull scheduler fixes from Ingo Molnar:
 "A Kconfig fix, a build fix and a membarrier bug fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  membarrier: Disable preemption when calling smp_call_function_many()
  sched/isolation: Make CONFIG_CPU_ISOLATION=y depend on SMP or COMPILE_TEST
  ia64, sched/cputime: Fix build error if CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y

membarrier: Disable preemption when calling smp_call_function_many()

2018-01-10T07:50:31+00:00

smp_call_function_many() requires disabling preemption around the call.

Signed-off-by: Mathieu Desnoyers 
Cc:  # v4.14+
Cc: Andrea Parri 
Cc: Andrew Hunter 
Cc: Avi Kivity 
Cc: Benjamin Herrenschmidt 
Cc: Boqun Feng 
Cc: Dave Watson 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Maged Michael 
Cc: Michael Ellerman 
Cc: Paul E . McKenney 
Cc: Paul E. McKenney 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/20171215192310.25293-1-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar

locking/lockdep: Remove cross-release leftovers

2018-01-08T16:30:45+00:00

There's two cross-release leftover facilities:

 - the crossrelease_hist_*() irq-tracing callbacks (NOPs currently)
 - the complete_release_commit() callback (NOP as well)

Remove them.

Cc: David Sterba 
Cc: Byungchul Park 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

cpufreq: schedutil: Use idle_calls counter of the remote CPU

2017-12-28T11:26:54+00:00

Since the recent remote cpufreq callback work, its possible that a cpufreq
update is triggered from a remote CPU. For single policies however, the current
code uses the local CPU when trying to determine if the remote sg_cpu entered
idle or is busy. This is incorrect. To remedy this, compare with the nohz tick
idle_calls counter of the remote CPU.

Fixes: 674e75411fc2 (sched: cpufreq: Allow remote cpufreq callbacks)
Acked-by: Viresh Kumar 
Acked-by: Peter Zijlstra (Intel) 
Signed-off-by: Joel Fernandes 
Cc: 4.14+  # 4.14+
Signed-off-by: Rafael J. Wysocki

sched/rt: Do not pull from current CPU if only one CPU to pull

2017-12-15T15:28:02+00:00

Daniel Wagner reported a crash on the BeagleBone Black SoC.

This is a single CPU architecture, and does not have a functional
arch_send_call_function_single_ipi() implementation which can crash
the kernel if that is called.

As it only has one CPU, it shouldn't be called, but if the kernel is
compiled for SMP, the push/pull RT scheduling logic now calls it for
irq_work if the one CPU is overloaded, it can use that function to call
itself and crash the kernel.

Ideally, we should disable the SCHED_FEAT(RT_PUSH_IPI) if the system
only has a single CPU. But SCHED_FEAT is a constant if sched debugging
is turned off. Another fix can also be used, and this should also help
with normal SMP machines. That is, do not initiate the pull code if
there's only one RT overloaded CPU, and that CPU happens to be the
current CPU that is scheduling in a lower priority task.

Even on a system with many CPUs, if there's many RT tasks waiting to
run on a single CPU, and that CPU schedules in another RT task of lower
priority, it will initiate the PULL logic in case there's a higher
priority RT task on another CPU that is waiting to run. But if there is
no other CPU with waiting RT tasks, it will initiate the RT pull logic
on itself (as it still has RT tasks waiting to run). This is a wasted
effort.

Not only does this help with SMP code where the current CPU is the only
one with RT overloaded tasks, it should also solve the issue that
Daniel encountered, because it will prevent the PULL logic from
executing, as there's only one CPU on the system, and the check added
here will cause it to exit the RT pull code.

Reported-by: Daniel Wagner 
Signed-off-by: Steven Rostedt (VMware) 
Acked-by: Peter Zijlstra 
Cc: Linus Torvalds 
Cc: Sebastian Andrzej Siewior 
Cc: Thomas Gleixner 
Cc: linux-rt-users 
Cc: stable@vger.kernel.org
Fixes: 4bdced5c9 ("sched/rt: Simplify the IPI based RT balancing logic")
Link: http://lkml.kernel.org/r/20171202130454.4cbbfe8d@vmware.local.home
Signed-off-by: Ingo Molnar

sched/core: Fix kernel-doc warnings after code movement

2017-12-11T15:10:42+00:00

Fix the following kernel-doc warnings after code restructuring:

  ../kernel/sched/core.c:5113: warning: No description found for parameter 't'
  ../kernel/sched/core.c:5113: warning: Excess function parameter 'interval' description in 'sched_rr_get_interval'

	get rid of set_fs()")

Signed-off-by: Randy Dunlap 
Cc: Al Viro 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Fixes: abca5fc535a3e ("sched_rr_get_interval(): move compat to native,
Link: http://lkml.kernel.org/r/995c6ded-b32e-bbe4-d9f5-4d42d121aff1@infradead.org
Signed-off-by: Ingo Molnar

sched/fair: Update and fix the runnable propagation rule

2017-12-06T18:30:50+00:00

Unlike running, the runnable part can't be directly propagated through
the hierarchy when we migrate a task. The main reason is that runnable
time can be shared with other sched_entities that stay on the rq and
this runnable time will also remain on prev cfs_rq and must not be
removed.

Instead, we can estimate what should be the new runnable of the prev
cfs_rq and check that this estimation stay in a possible range. The
prop_runnable_sum is a good estimation when adding runnable_sum but
fails most often when we remove it. Instead, we could use the formula
below instead:

  gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight

which assumes that tasks are equally runnable which is not true but
easy to compute.

Beside these estimates, we have several simple rules that help us to filter
out wrong ones:

 - ge->avg.runnable_sum <= than LOAD_AVG_MAX
 - ge->avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX)
 - ge->avg.runnable_sum can't increase when we detach a task

The effect of these fixes is better cgroups balancing.

Signed-off-by: Vincent Guittot 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Peter Zijlstra (Intel) 
Cc: Ben Segall 
Cc: Chris Mason 
Cc: Dietmar Eggemann 
Cc: Josef Bacik 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Morten Rasmussen 
Cc: Paul Turner 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
Cc: Thomas Gleixner 
Cc: Yuyang Du 
Link: http://lkml.kernel.org/r/1510842112-21028-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar

sched/wait: Fix add_wait_queue() behavioral change

2017-12-06T18:30:34+00:00

The following cleanup commit:

  50816c48997a ("sched/wait: Standardize internal naming of wait-queue entries")

... unintentionally changed the behavior of add_wait_queue() from
inserting the wait entry at the head of the wait queue to the tail
of the wait queue.

Beyond a negative performance impact this change in behavior
theoretically also breaks wait queues which mix exclusive and
non-exclusive waiters, as non-exclusive waiters will not be
woken up if they are queued behind enough exclusive waiters.

Signed-off-by: Omar Sandoval 
Reviewed-by: Jens Axboe 
Acked-by: Peter Zijlstra 
Cc: Linus Torvalds 
Cc: Thomas Gleixner 
Cc: kernel-team@fb.com
Fixes: ("sched/wait: Standardize internal naming of wait-queue entries")
Link: http://lkml.kernel.org/r/a16c8ccffd39bd08fdaa45a5192294c784b803a7.1512544324.git.osandov@fb.com
Signed-off-by: Ingo Molnar