linux-stable.git/kernel/sched.c, branch linux-2.6.37.y

ftrace: Fix memory leak with function graph and cpu hotplug

2011-03-23T19:50:02+00:00

commit 868baf07b1a259f5f3803c1dc2777b6c358f83cf upstream.

When the fuction graph tracer starts, it needs to make a special
stack for each task to save the real return values of the tasks.
All running tasks have this stack created, as well as any new
tasks.

On CPU hot plug, the new idle task will allocate a stack as well
when init_idle() is called. The problem is that cpu hotplug does
not create a new idle_task. Instead it uses the idle task that
existed when the cpu went down.

ftrace_graph_init_task() will add a new ret_stack to the task
that is given to it. Because a clone will make the task
have a stack of its parent it does not check if the task's
ret_stack is already NULL or not. When the CPU hotplug code
starts a CPU up again, it will allocate a new stack even
though one already existed for it.

The solution is to treat the idle_task specially. In fact, the
function_graph code already does, just not at init_idle().
Instead of using the ftrace_graph_init_task() for the idle task,
which that function expects the task to be a clone, have a
separate ftrace_graph_init_idle_task(). Also, we will create a
per_cpu ret_stack that is used by the idle task. When we call
ftrace_graph_init_idle_task() it will check if the idle task's
ret_stack is NULL, if it is, then it will assign it the per_cpu
ret_stack.

Reported-by: Benjamin Herrenschmidt 
Suggested-by: Peter Zijlstra 
Signed-off-by: Steven Rostedt 
Signed-off-by: Greg Kroah-Hartman

sched, cgroup: Use exit hook to avoid use-after-free crash

2011-02-17T23:14:40+00:00

commit 068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267 upstream.

By not notifying the controller of the on-exit move back to
init_css_set, we fail to move the task out of the previous
cgroup's cfs_rq. This leads to an opportunity for a
cgroup-destroy to come in and free the cgroup (there are no
active tasks left in it after all) to which the not-quite dead
task is still enqueued.

Reported-by: Miklos Vajna 
Fixed-by: Mike Galbraith 
Signed-off-by: Peter Zijlstra 
Cc: Mike Galbraith 
Signed-off-by: Ingo Molnar 
LKML-Reference: <1293206353.29444.205.camel@laptop>
Signed-off-by: Greg Kroah-Hartman

sched: Change wait_for_completion_*_timeout() to return a signed long

2011-02-17T23:14:39+00:00

commit 6bf4123760a5aece6e4829ce90b70b6ffd751d65 upstream.

wait_for_completion_*_timeout() can return:

   0: if the wait timed out
 -ve: if the wait was interrupted
 +ve: if the completion was completed.

As they currently return an 'unsigned long', the last two cases
are not easily distinguished which can easily result in buggy
code, as is the case for the recently added
wait_for_completion_interruptible_timeout() call in
net/sunrpc/cache.c

So change them both to return 'long'.  As MAX_SCHEDULE_TIMEOUT
is LONG_MAX, a large +ve return value should never overflow.

Signed-off-by: NeilBrown 
Cc: Peter Zijlstra 
Cc: J.  Bruce Fields 
Cc: Andrew Morton 
Cc: Linus Torvalds 
LKML-Reference: <20110105125016.64ccab0e@notabene.brown>
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

sched: Remove debugging check

2010-12-19T22:24:27+00:00

Linus reported that the new warning introduced by commit f26f9aff6aaf
"Sched: fix skip_clock_update optimization" triggers. The need_resched
flag can be set by other CPUs asynchronously so this debug check is
bogus - remove it.

Reported-by: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Mike Galbraith 
LKML-Reference: 
Signed-off-by: Ingo Molnar

sched: Fix the irqtime code for 32bit

2010-12-16T10:17:47+00:00

Since the irqtime accounting is using non-atomic u64 and can be read
from remote cpus (writes are strictly cpu local, reads are not) we
have to deal with observing partial updates.

When we do observe partial updates the clock movement (in particular,
->clock_task movement) will go funny (in either direction), a
subsequent clock update (observing the full update) will make it go
funny in the oposite direction.

Since we rely on these clocks to be strictly monotonic we cannot
suffer backwards motion. One possible solution would be to simply
ignore all backwards deltas, but that will lead to accounting
artefacts, most notable: clock_task + irq_time != clock, this
inaccuracy would end up in user visible stats.

Therefore serialize the reads using a seqcount.

Reviewed-by: Venkatesh Pallipadi 
Reported-by: Mikael Pettersson 
Tested-by: Mikael Pettersson 
Signed-off-by: Peter Zijlstra 
LKML-Reference: <1292242434.6803.200.camel@twins>
Signed-off-by: Ingo Molnar

sched: Fix the irqtime code to deal with u64 wraps

2010-12-16T10:17:46+00:00

Some ARM systems have a short sched_clock() [ which needs to be fixed
too ], but this exposed a bug in the irq_time code as well, it doesn't
deal with wraps at all.

Fix the irq_time code to deal with u64 wraps by re-writing the code to
only use delta increments, which avoids the whole issue.

Reviewed-by: Venkatesh Pallipadi 
Reported-by: Mikael Pettersson 
Tested-by: Mikael Pettersson 
Signed-off-by: Peter Zijlstra 
LKML-Reference: <1292242433.6803.199.camel@twins>
Signed-off-by: Ingo Molnar

Sched: fix skip_clock_update optimization

2010-12-08T19:15:06+00:00

idle_balance() drops/retakes rq->lock, leaving the previous task
vulnerable to set_tsk_need_resched().  Clear it after we return
from balancing instead, and in setup_thread_stack() as well, so
no successfully descheduled or never scheduled task has it set.

Need resched confused the skip_clock_update logic, which assumes
that the next call to update_rq_clock() will come nearly immediately
after being set.  Make the optimization robust against the waking
a sleeper before it sucessfully deschedules case by checking that
the current task has not been dequeued before setting the flag,
since it is that useless clock update we're trying to save, and
clear unconditionally in schedule() proper instead of conditionally
in put_prev_task().

Signed-off-by: Mike Galbraith 
Reported-by: Bjoern B. Brandenburg 
Tested-by: Yong Zhang 
Signed-off-by: Peter Zijlstra 
Cc: stable@kernel.org
LKML-Reference: <1291802742.1417.9.camel@marge.simson.net>
Signed-off-by: Ingo Molnar

sched: Cure more NO_HZ load average woes

2010-12-08T19:15:04+00:00

There's a long-running regression that proved difficult to fix and
which is hitting certain people and is rather annoying in its effects.

Damien reported that after 74f5187ac8 (sched: Cure load average vs
NO_HZ woes) his load average is unnaturally high, he also noted that
even with that patch reverted the load avgerage numbers are not
correct.

The problem is that the previous patch only solved half the NO_HZ
problem, it addressed the part of going into NO_HZ mode, not of
comming out of NO_HZ mode. This patch implements that missing half.

When comming out of NO_HZ mode there are two important things to take
care of:

 - Folding the pending idle delta into the global active count.
 - Correctly aging the averages for the idle-duration.

So with this patch the NO_HZ interaction should be complete and
behaviour between CONFIG_NO_HZ=[yn] should be equivalent.

Furthermore, this patch slightly changes the load average computation
by adding a rounding term to the fixed point multiplication.

Reported-by: Damien Wyart 
Reported-by: Tim McGrath 
Tested-by: Damien Wyart 
Tested-by: Orion Poplawski 
Tested-by: Kyle McMartin 
Signed-off-by: Peter Zijlstra 
Cc: stable@kernel.org
Cc: Chase Douglas 
LKML-Reference: <1291129145.32004.874.camel@laptop>
Signed-off-by: Ingo Molnar

sched: Fix cross-sched-class wakeup preemption

2010-11-11T13:37:23+00:00

Instead of dealing with sched classes inside each check_preempt_curr()
implementation, pull out this logic into the generic wakeup preemption
path.

This fixes a hang in KVM (and others) where we are waiting for the
stop machine thread to run ...

Reported-by: Markus Trippelsdorf 
Tested-by: Marcelo Tosatti 
Tested-by: Sergey Senozhatsky 
Signed-off-by: Peter Zijlstra 
LKML-Reference: <1288891946.2039.31.camel@laptop>
Signed-off-by: Ingo Molnar

sched: Use group weight, idle cpu metrics to fix imbalances during idle

2010-11-10T22:13:56+00:00

Currently we consider a sched domain to be well balanced when the imbalance
is less than the domain's imablance_pct. As the number of cores and threads
are increasing, current values of imbalance_pct (for example 25% for a
NUMA domain) are not enough to detect imbalances like:

a) On a WSM-EP system (two sockets, each having 6 cores and 12 logical threads),
24 cpu-hogging tasks get scheduled as 13 on one socket and 11 on another
socket. Leading to an idle HT cpu.

b) On a hypothetial 2 socket NHM-EX system (each socket having 8 cores and
16 logical threads), 16 cpu-hogging tasks can get scheduled as 9 on one
socket and 7 on another socket. Leaving one core in a socket idle
whereas in another socket we have a core having both its HT siblings busy.

While this issue can be fixed by decreasing the domain's imbalance_pct
(by making it a function of number of logical cpus in the domain), it
can potentially cause more task migrations across sched groups in an
overloaded case.

Fix this by using imbalance_pct only during newly_idle and busy
load balancing. And during idle load balancing, check if there
is an imbalance in number of idle cpu's across the busiest and this
sched_group or if the busiest group has more tasks than its weight that
the idle cpu in this_group can pull.

Reported-by: Nikhil Rao 
Signed-off-by: Suresh Siddha 
Signed-off-by: Peter Zijlstra 
LKML-Reference: <1284760952.2676.11.camel@sbsiddha-MOBL3.sc.intel.com>
Signed-off-by: Ingo Molnar