linux.git/kernel/sched, branch v4.0-rc2

Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2015-02-21T18:40:02+00:00

Pull scheduler fixes from Ingo Molnar:
 "Thiscontains misc fixes: preempt_schedule_common() and io_schedule()
  recursion fixes, sched/dl fixes, a completion_done() revert, two
  sched/rt fixes and a comment update patch"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/rt: Avoid obvious configuration fail
  sched/autogroup: Fix failure to set cpu.rt_runtime_us
  sched/dl: Do update_rq_clock() in yield_task_dl()
  sched: Prevent recursion in io_schedule()
  sched/completion: Serialize completion_done() with complete()
  sched: Fix preempt_schedule_common() triggering tracing recursion
  sched/dl: Prevent enqueue of a sleeping task in dl_task_timer()
  sched: Make dl_task_time() use task_rq_lock()
  sched: Clarify ordering between task_rq_lock() and move_queued_task()

sched/rt: Avoid obvious configuration fail

2015-02-18T15:17:23+00:00

Setting the root group's cpu.rt_runtime_us to 0 is a bad thing; it
would disallow the kernel creating RT tasks.

One can of course still set it to 1, which will (likely) still wreck
your kernel, but at least make it clear that setting it to 0 is not
good.

Collect both sanity checks into the one place while we're there.

Suggested-by: Zefan Li 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/20150209112715.GO24151@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

sched/autogroup: Fix failure to set cpu.rt_runtime_us

2015-02-18T15:17:20+00:00

Because task_group() uses a cache of autogroup_task_group(), whose
output depends on sched_class, switching classes can generate
problems.

In particular, when started as fair, the cache points to the
autogroup, so when switching to RT the tg_rt_schedulable() test fails
for every cpu.rt_{runtime,period}_us change because now the autogroup
has tasks and no runtime.

Furthermore, going back to the previous semantics of varying
task_group() with sched_class has the down-side that the sched_debug
output varies as well, even though the task really is in the
autogroup.

Therefore add an autogroup exception to tg_has_rt_tasks() -- such that
both (all) task_group() usages in sched/core now have one. And remove
all the remnants of the variable task_group() output.

Reported-by: Zefan Li 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Stefan Bader 
Fixes: 8323f26ce342 ("sched: Fix race in task_group()")
Link: http://lkml.kernel.org/r/20150209112237.GR5029@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

sched/dl: Do update_rq_clock() in yield_task_dl()

2015-02-18T15:17:12+00:00

update_curr_dl() needs actual rq clock.

Signed-off-by: Kirill Tkhai 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/1423040972.18770.10.camel@tkhai
Signed-off-by: Ingo Molnar

sched: Prevent recursion in io_schedule()

2015-02-18T13:27:44+00:00

io_schedule() calls blk_flush_plug() which, depending on the
contents of current->plug, can initiate arbitrary blk-io requests.

Note that this contrasts with blk_schedule_flush_plug() which requires
all non-trivial work to be handed off to a separate thread.

This makes it possible for io_schedule() to recurse, and initiating
block requests could possibly call mempool_alloc() which, in times of
memory pressure, uses io_schedule().

Apart from any stack usage issues, io_schedule() will not behave
correctly when called recursively as delayacct_blkio_start() does
not allow for repeated calls.

So:
 - use ->in_iowait to detect recursion.  Set it earlier, and restore
   it to the old value.
 - move the call to "raw_rq" after the call to blk_flush_plug().
   As this is some sort of per-cpu thing, we want some chance that
   we are on the right CPU
 - When io_schedule() is called recurively, use blk_schedule_flush_plug()
   which cannot further recurse.
 - as this makes io_schedule() a lot more complex and as io_schedule()
   must match io_schedule_timeout(), but all the changes in io_schedule_timeout()
   and make io_schedule a simple wrapper for that.

Signed-off-by: NeilBrown 
Signed-off-by: Peter Zijlstra (Intel) 
[ Moved the now rudimentary io_schedule() into sched.h. ]
Cc: Jens Axboe 
Cc: Linus Torvalds 
Cc: Tony Battersby 
Link: http://lkml.kernel.org/r/20150213162600.059fffb2@notabene.brown
Signed-off-by: Ingo Molnar

sched/completion: Serialize completion_done() with complete()

2015-02-18T13:27:40+00:00

Commit de30ec47302c "Remove unnecessary ->wait.lock serialization when
reading completion state" was not correct, without lock/unlock the code
like stop_machine_from_inactive_cpu()

	while (!completion_done())
		cpu_relax();

can return before complete() finishes its spin_unlock() which writes to
this memory. And spin_unlock_wait().

While at it, change try_wait_for_completion() to use READ_ONCE().

Reported-by: Paul E. McKenney 
Reported-by: Davidlohr Bueso 
Tested-by: Paul E. McKenney 
Signed-off-by: Oleg Nesterov 
Signed-off-by: Peter Zijlstra (Intel) 
[ Added a comment with the barrier. ]
Cc: Linus Torvalds 
Cc: Nicholas Mc Guire 
Cc: raghavendra.kt@linux.vnet.ibm.com
Cc: waiman.long@hp.com
Fixes: de30ec47302c ("sched/completion: Remove unnecessary ->wait.lock serialization when reading completion state")
Link: http://lkml.kernel.org/r/20150212195913.GA30430@redhat.com
Signed-off-by: Ingo Molnar

sched: Fix preempt_schedule_common() triggering tracing recursion

2015-02-18T13:27:38+00:00

Since the function graph tracer needs to disable preemption, it might
call preempt_schedule() after reenabling  it if something triggered the
need for rescheduling in between.

Therefore we can't trace preempt_schedule() itself because we would
face a function tracing recursion otherwise as the tracer is always
called before PREEMPT_ACTIVE gets set to prevent that recursion. This is
why preempt_schedule() is tagged as "notrace".

But the same issue applies to every function called by preempt_schedule()
before PREEMPT_ACTIVE is actually set. And preempt_schedule_common() is
one such example. Unfortunately we forgot to tag it as notrace as well
and as a result we are encountering tracing recursion since it got
introduced by:

   a18b5d0181923 ("sched: Fix missing preemption opportunity")

Let's fix that by applying the appropriate function tag to
preempt_schedule_common().

Reported-by: Huang Ying 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Steven Rostedt 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/1424110807-15057-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

sched/dl: Prevent enqueue of a sleeping task in dl_task_timer()

2015-02-18T13:27:31+00:00

A deadline task may be throttled and dequeued at the same time.
This happens, when it becomes throttled in schedule(), which
is called to go to sleep:

current->state = TASK_INTERRUPTIBLE;
schedule()
    deactivate_task()
        dequeue_task_dl()
            update_curr_dl()
                start_dl_timer()
            __dequeue_task_dl()
    prev->on_rq = 0;

Later the timer fires, but the task is still dequeued:

dl_task_timer()
    enqueue_task_dl() /* queues on dl_rq; on_rq remains 0 */

Someone wakes it up:

try_to_wake_up()

    enqueue_dl_entity()
        BUG_ON(on_dl_rq())

Patch fixes this problem, it prevents queueing !on_rq tasks
on dl_rq.

Reported-by: Fengguang Wu 
Signed-off-by: Kirill Tkhai 
Signed-off-by: Peter Zijlstra (Intel) 
[ Wrote comment. ]
Cc: Juri Lelli 
Fixes: 1019a359d3dc ("sched/deadline: Fix stale yield state")
Link: http://lkml.kernel.org/r/1374601424090314@web4j.yandex.ru
Signed-off-by: Ingo Molnar

sched: Make dl_task_time() use task_rq_lock()

2015-02-18T13:27:30+00:00

Kirill reported that a dl task can be throttled and dequeued at the
same time. This happens, when it becomes throttled in schedule(),
which is called to go to sleep:

current->state = TASK_INTERRUPTIBLE;
schedule()
    deactivate_task()
        dequeue_task_dl()
            update_curr_dl()
                start_dl_timer()
            __dequeue_task_dl()
    prev->on_rq = 0;

This invalidates the assumption from commit 0f397f2c90ce ("sched/dl:
Fix race in dl_task_timer()"):

  "The only reason we don't strictly need ->pi_lock now is because
   we're guaranteed to have p->state == TASK_RUNNING here and are
   thus free of ttwu races".

And therefore we have to use the full task_rq_lock() here.

This further amends the fact that we forgot to update the rq lock loop
for TASK_ON_RQ_MIGRATE, from commit cca26e8009d1 ("sched: Teach
scheduler to understand TASK_ON_RQ_MIGRATING state").

Reported-by: Kirill Tkhai 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Juri Lelli 
Link: http://lkml.kernel.org/r/20150217123139.GN5029@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

sched: Clarify ordering between task_rq_lock() and move_queued_task()

2015-02-18T13:27:28+00:00

There was a wee bit of confusion around the exact ordering here;
clarify things.

Reported-by: Kirill Tkhai 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Oleg Nesterov 
Cc: Paul E. McKenney 
Link: http://lkml.kernel.org/r/20150217121258.GM5029@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar