linux-stable.git/kernel/sched.c, branch linux-2.6.35.y

ftrace: Fix memory leak with function graph and cpu hotplug

2011-03-31T18:58:36+00:00

commit 868baf07b1a259f5f3803c1dc2777b6c358f83cf upstream.

When the fuction graph tracer starts, it needs to make a special
stack for each task to save the real return values of the tasks.
All running tasks have this stack created, as well as any new
tasks.

On CPU hot plug, the new idle task will allocate a stack as well
when init_idle() is called. The problem is that cpu hotplug does
not create a new idle_task. Instead it uses the idle task that
existed when the cpu went down.

ftrace_graph_init_task() will add a new ret_stack to the task
that is given to it. Because a clone will make the task
have a stack of its parent it does not check if the task's
ret_stack is already NULL or not. When the CPU hotplug code
starts a CPU up again, it will allocate a new stack even
though one already existed for it.

The solution is to treat the idle_task specially. In fact, the
function_graph code already does, just not at init_idle().
Instead of using the ftrace_graph_init_task() for the idle task,
which that function expects the task to be a clone, have a
separate ftrace_graph_init_idle_task(). Also, we will create a
per_cpu ret_stack that is used by the idle task. When we call
ftrace_graph_init_idle_task() it will check if the idle task's
ret_stack is NULL, if it is, then it will assign it the per_cpu
ret_stack.

Reported-by: Benjamin Herrenschmidt 
Suggested-by: Peter Zijlstra 
Signed-off-by: Steven Rostedt 
Signed-off-by: Greg Kroah-Hartman 
Signed-off-by: Andi Kleen

sched: Use group weight, idle cpu metrics to fix imbalances during idle

2011-03-31T18:58:02+00:00

Commit: aae6d3ddd8b90f5b2c8d79a2b914d1706d124193 upstream

Currently we consider a sched domain to be well balanced when the imbalance
is less than the domain's imablance_pct. As the number of cores and threads
are increasing, current values of imbalance_pct (for example 25% for a
NUMA domain) are not enough to detect imbalances like:

a) On a WSM-EP system (two sockets, each having 6 cores and 12 logical threads),
24 cpu-hogging tasks get scheduled as 13 on one socket and 11 on another
socket. Leading to an idle HT cpu.

b) On a hypothetial 2 socket NHM-EX system (each socket having 8 cores and
16 logical threads), 16 cpu-hogging tasks can get scheduled as 9 on one
socket and 7 on another socket. Leaving one core in a socket idle
whereas in another socket we have a core having both its HT siblings busy.

While this issue can be fixed by decreasing the domain's imbalance_pct
(by making it a function of number of logical cpus in the domain), it
can potentially cause more task migrations across sched groups in an
overloaded case.

Fix this by using imbalance_pct only during newly_idle and busy
load balancing. And during idle load balancing, check if there
is an imbalance in number of idle cpu's across the busiest and this
sched_group or if the busiest group has more tasks than its weight that
the idle cpu in this_group can pull.

Reported-by: Nikhil Rao 
Signed-off-by: Suresh Siddha 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Andi Kleen 
LKML-Reference: <1284760952.2676.11.camel@sbsiddha-MOBL3.sc.intel.com>
Signed-off-by: Ingo Molnar 
Signed-off-by: Mike Galbraith 
Acked-by: Peter Zijlstra 
Signed-off-by: Greg Kroah-Hartman

sched, cgroup: Fixup broken cgroup movement

2011-03-31T18:58:02+00:00

Commit: b2b5ce022acf5e9f52f7b78c5579994fdde191d4 upstream

Dima noticed that we fail to correct the ->vruntime of sleeping tasks
when we move them between cgroups.

Reported-by: Dima Zavin 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Andi Kleen 
Tested-by: Mike Galbraith 
LKML-Reference: <1287150604.29097.1513.camel@twins>
Signed-off-by: Ingo Molnar 
Signed-off-by: Mike Galbraith 
Acked-by: Peter Zijlstra 
Signed-off-by: Greg Kroah-Hartman

sched: Export account_system_vtime()

2011-03-31T18:58:01+00:00

Commit: b7dadc38797584f6203386da1947ed5edf516646 upstream

KVM uses it for example:

 ERROR: "account_system_vtime" [arch/x86/kvm/kvm.ko] undefined!

Cc: Venkatesh Pallipadi 
Cc: Peter Zijlstra 
LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar 
Signed-off-by: Mike Galbraith 
Acked-by: Peter Zijlstra 
Signed-off-by: Greg Kroah-Hartman 
Signed-off-by: Andi Kleen

sched: Call tick_check_idle before __irq_enter

2011-03-31T18:58:01+00:00

Commit: d267f87fb8179c6dba03d08b91952e81bc3723c7 upstream

When CPU is idle and on first interrupt, irq_enter calls tick_check_idle()
to notify interruption from idle. But, there is a problem if this call
is done after __irq_enter, as all routines in __irq_enter may find
stale time due to yet to be done tick_check_idle.

Specifically, trace calls in __irq_enter when they use global clock and also
account_system_vtime change in this patch as it wants to use sched_clock_cpu()
to do proper irq timing.

But, tick_check_idle was moved after __irq_enter intentionally to
prevent problem of unneeded ksoftirqd wakeups by the commit ee5f80a:

    irq: call __irq_enter() before calling the tick_idle_check
    Impact: avoid spurious ksoftirqd wakeups

Moving tick_check_idle() before __irq_enter and wrapping it with
local_bh_enable/disable would solve both the problems.

Fixed-by: Yong Zhang 
Signed-off-by: Venkatesh Pallipadi 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Andi Kleen 
LKML-Reference: <1286237003-12406-9-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar 
Signed-off-by: Mike Galbraith 
Acked-by: Peter Zijlstra 
Signed-off-by: Greg Kroah-Hartman

sched: Remove irq time from available CPU power

2011-03-31T18:58:01+00:00

Commit: aa483808516ca5cacfa0e5849691f64fec25828e upstream

The idea was suggested by Peter Zijlstra here:

  http://marc.info/?l=linux-kernel&m=127476934517534&w=2

irq time is technically not available to the tasks running on the CPU.
This patch removes irq time from CPU power piggybacking on
sched_rt_avg_update().

Tested this by keeping CPU X busy with a network intensive task having 75%
oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven
cycle soakers on the system. Without this change, there will be two tasks on
each CPU. With this change, there is a single task on irq busy CPU X and
remaining 7 tasks are spread around among other 3 CPUs.

Signed-off-by: Venkatesh Pallipadi 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Andi Kleen 
LKML-Reference: <1286237003-12406-8-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar 
Signed-off-by: Mike Galbraith 
Acked-by: Peter Zijlstra 
Signed-off-by: Greg Kroah-Hartman

sched: Do not account irq time to current task

2011-03-31T18:58:01+00:00

Commit: 305e6835e05513406fa12820e40e4a8ecb63743c upstream

Scheduler accounts both softirq and interrupt processing times to the
currently running task. This means, if the interrupt processing was
for some other task in the system, then the current task ends up being
penalized as it gets shorter runtime than otherwise.

Change sched task accounting to acoount only actual task time from
currently running task. Now update_curr(), modifies the delta_exec to
depend on rq->clock_task.

Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
for later.

This change will impact scheduling behavior in interrupt heavy conditions.

Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
spending 75%+ of its time in irq processing. CPU 3 spending around 35%
time running nc task.

Now, if I run another CPU intensive task on CPU 2, without this change
/proc//schedstat shows 100% of time accounted to this task. With this
change, it rightly shows less than 25% accounted to this task as remaining
time is actually spent on irq processing.

Signed-off-by: Venkatesh Pallipadi 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Andi Kleen 
LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar 
Signed-off-by: Mike Galbraith 
Acked-by: Peter Zijlstra 
Signed-off-by: Greg Kroah-Hartman

sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time

2011-03-31T18:58:00+00:00

Commit: b52bfee445d315549d41eacf2fa7c156e7d153d5 upstream

s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does
the fine granularity accounting of user, system, hardirq, softirq times.
Adding that option on archs like x86 will be challenging however, given the
state of TSC reliability on various platforms and also the overhead it will
add in syscall entry exit.

Instead, add a lighter variant that only does finer accounting of
hardirq and softirq times, providing precise irq times (instead of timer tick
based samples). This accounting is added with a new config option
CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not
interested in paying the perf penalty.

This accounting is based on sched_clock, with the code being generic.
So, other archs may find it useful as well.

This patch just adds the core logic and does not enable this logic yet.

Signed-off-by: Venkatesh Pallipadi 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Andi Kleen 
LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar 
Signed-off-by: Mike Galbraith 
Acked-by: Peter Zijlstra 
Signed-off-by: Greg Kroah-Hartman

sched: Fix softirq time accounting

2011-03-31T18:57:59+00:00

Commit: 75e1056f5c57050415b64cb761a3acc35d91f013 upstream

Peter Zijlstra found a bug in the way softirq time is accounted in
VIRT_CPU_ACCOUNTING on this thread:

   http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html

The problem is, softirq processing uses local_bh_disable internally. There
is no way, later in the flow, to differentiate between whether softirq is
being processed or is it just that bh has been disabled. So, a hardirq when bh
is disabled results in time being wrongly accounted as softirq.

Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
as well. As account_system_time() in normal tick based accouting also uses
softirq_count, which will be set even when not in softirq with bh disabled.

Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count
for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
processing. The patch below does that and adds API in_serving_softirq() which
returns whether we are currently processing softirq or not.

Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
to in_serving_softirq.

Looks like many usages of in_softirq really want in_serving_softirq. Those
changes can be made individually on a case by case basis.

Signed-off-by: Venkatesh Pallipadi 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Andi Kleen 
LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar 
Signed-off-by: Mike Galbraith 
Acked-by: Peter Zijlstra 
Signed-off-by: Greg Kroah-Hartman

sched: Do not consider SCHED_IDLE tasks to be cache hot

2011-03-31T18:57:58+00:00

Commit: ef8002f6848236de5adc613063ebeabddea8a6fb upstream

This patch adds a check in task_hot to return if the task has SCHED_IDLE
policy. SCHED_IDLE tasks have very low weight, and when run with regular
workloads, are typically scheduled many milliseconds apart. There is no
need to consider these tasks hot for load balancing.

Signed-off-by: Nikhil Rao 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Andi Kleen 
LKML-Reference: <1287173550-30365-2-git-send-email-ncrao@google.com>
Signed-off-by: Ingo Molnar 
Signed-off-by: Mike Galbraith 
Acked-by: Peter Zijlstra 
Signed-off-by: Greg Kroah-Hartman