linux-stable.git/include/linux/sched.h, branch linux-2.6.35.y

sched: Use group weight, idle cpu metrics to fix imbalances during idle

2011-03-31T18:58:02+00:00

Commit: aae6d3ddd8b90f5b2c8d79a2b914d1706d124193 upstream

Currently we consider a sched domain to be well balanced when the imbalance
is less than the domain's imablance_pct. As the number of cores and threads
are increasing, current values of imbalance_pct (for example 25% for a
NUMA domain) are not enough to detect imbalances like:

a) On a WSM-EP system (two sockets, each having 6 cores and 12 logical threads),
24 cpu-hogging tasks get scheduled as 13 on one socket and 11 on another
socket. Leading to an idle HT cpu.

b) On a hypothetial 2 socket NHM-EX system (each socket having 8 cores and
16 logical threads), 16 cpu-hogging tasks can get scheduled as 9 on one
socket and 7 on another socket. Leaving one core in a socket idle
whereas in another socket we have a core having both its HT siblings busy.

While this issue can be fixed by decreasing the domain's imbalance_pct
(by making it a function of number of logical cpus in the domain), it
can potentially cause more task migrations across sched groups in an
overloaded case.

Fix this by using imbalance_pct only during newly_idle and busy
load balancing. And during idle load balancing, check if there
is an imbalance in number of idle cpu's across the busiest and this
sched_group or if the busiest group has more tasks than its weight that
the idle cpu in this_group can pull.

Reported-by: Nikhil Rao 
Signed-off-by: Suresh Siddha 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Andi Kleen 
LKML-Reference: <1284760952.2676.11.camel@sbsiddha-MOBL3.sc.intel.com>
Signed-off-by: Ingo Molnar 
Signed-off-by: Mike Galbraith 
Acked-by: Peter Zijlstra 
Signed-off-by: Greg Kroah-Hartman

sched, cgroup: Fixup broken cgroup movement

2011-03-31T18:58:02+00:00

Commit: b2b5ce022acf5e9f52f7b78c5579994fdde191d4 upstream

Dima noticed that we fail to correct the ->vruntime of sleeping tasks
when we move them between cgroups.

Reported-by: Dima Zavin 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Andi Kleen 
Tested-by: Mike Galbraith 
LKML-Reference: <1287150604.29097.1513.camel@twins>
Signed-off-by: Ingo Molnar 
Signed-off-by: Mike Galbraith 
Acked-by: Peter Zijlstra 
Signed-off-by: Greg Kroah-Hartman

sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time

2011-03-31T18:58:00+00:00

Commit: b52bfee445d315549d41eacf2fa7c156e7d153d5 upstream

s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does
the fine granularity accounting of user, system, hardirq, softirq times.
Adding that option on archs like x86 will be challenging however, given the
state of TSC reliability on various platforms and also the overhead it will
add in syscall entry exit.

Instead, add a lighter variant that only does finer accounting of
hardirq and softirq times, providing precise irq times (instead of timer tick
based samples). This accounting is added with a new config option
CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not
interested in paying the perf penalty.

This accounting is based on sched_clock, with the code being generic.
So, other archs may find it useful as well.

This patch just adds the core logic and does not enable this logic yet.

Signed-off-by: Venkatesh Pallipadi 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Andi Kleen 
LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar 
Signed-off-by: Mike Galbraith 
Acked-by: Peter Zijlstra 
Signed-off-by: Greg Kroah-Hartman

sched: Add a PF flag for ksoftirqd identification

2011-03-31T18:58:00+00:00

Commit: 6cdd5199daf0cb7b0fcc8dca941af08492612887 upstream

To account softirq time cleanly in scheduler, we need to identify whether
softirq is invoked in ksoftirqd context or softirq at hardirq tail context.
Add PF_KSOFTIRQD for that purpose.

As all PF flag bits are currently taken, create space by moving one of the
infrequently used bits (PF_THREAD_BOUND) down in task_struct to be along
with some other state fields.

Signed-off-by: Venkatesh Pallipadi 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Andi Kleen 
LKML-Reference: <1286237003-12406-4-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar 
Signed-off-by: Mike Galbraith 
Acked-by: Peter Zijlstra 
Signed-off-by: Greg Kroah-Hartman

sched: Remove unused PF_ALIGNWARN flag

2011-03-31T18:58:00+00:00

Commit: 637bbdc5b83615ef9f45f50399d1c7f27473c713 upstream

PF_ALIGNWARN is not implemented and it is for 486 as the
comment.

It is not likely someone will implement this flag feature.
So here remove this flag and leave the valuable 0x00000001 for
future use.

Signed-off-by: Dave Young 
Signed-off-by: Andi Kleen 
Cc: Peter Zijlstra 
Cc: Linus Torvalds 
LKML-Reference: <20100913121903.GB22238@darkstar>
Signed-off-by: Ingo Molnar 
Signed-off-by: Mike Galbraith 
Acked-by: Peter Zijlstra 
Signed-off-by: Greg Kroah-Hartman

sched: Fix softirq time accounting

2011-03-31T18:57:59+00:00

Commit: 75e1056f5c57050415b64cb761a3acc35d91f013 upstream

Peter Zijlstra found a bug in the way softirq time is accounted in
VIRT_CPU_ACCOUNTING on this thread:

   http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html

The problem is, softirq processing uses local_bh_disable internally. There
is no way, later in the flow, to differentiate between whether softirq is
being processed or is it just that bh has been disabled. So, a hardirq when bh
is disabled results in time being wrongly accounted as softirq.

Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
as well. As account_system_time() in normal tick based accouting also uses
softirq_count, which will be set even when not in softirq with bh disabled.

Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count
for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
processing. The patch below does that and adds API in_serving_softirq() which
returns whether we are currently processing softirq or not.

Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
to in_serving_softirq.

Looks like many usages of in_softirq really want in_serving_softirq. Those
changes can be made individually on a case by case basis.

Signed-off-by: Venkatesh Pallipadi 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Andi Kleen 
LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com>
Signed-off-by: Ingo Molnar 
Signed-off-by: Mike Galbraith 
Acked-by: Peter Zijlstra 
Signed-off-by: Greg Kroah-Hartman

sched: Cure more NO_HZ load average woes

2011-02-06T19:03:41+00:00

commit 0f004f5a696a9434b7214d0d3cbd0525ee77d428 upstream.

There's a long-running regression that proved difficult to fix and
which is hitting certain people and is rather annoying in its effects.

Damien reported that after 74f5187ac8 (sched: Cure load average vs
NO_HZ woes) his load average is unnaturally high, he also noted that
even with that patch reverted the load avgerage numbers are not
correct.

The problem is that the previous patch only solved half the NO_HZ
problem, it addressed the part of going into NO_HZ mode, not of
comming out of NO_HZ mode. This patch implements that missing half.

When comming out of NO_HZ mode there are two important things to take
care of:

 - Folding the pending idle delta into the global active count.
 - Correctly aging the averages for the idle-duration.

So with this patch the NO_HZ interaction should be complete and
behaviour between CONFIG_NO_HZ=[yn] should be equivalent.

Furthermore, this patch slightly changes the load average computation
by adding a rounding term to the fixed point multiplication.

Reported-by: Damien Wyart 
Reported-by: Tim McGrath 
Tested-by: Damien Wyart 
Tested-by: Orion Poplawski 
Tested-by: Kyle McMartin 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Andi Kleen 
Cc: Chase Douglas 
LKML-Reference: <1291129145.32004.874.camel@laptop>
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

sched: Revert nohz_ratelimit() for now

2010-08-13T20:31:02+00:00

commit 396e894d289d69bacf5acd983c97cd6e21a14c08 upstream.

Norbert reported that nohz_ratelimit() causes his laptop to burn about
4W (40%) extra. For now back out the change and see if we can adjust
the power management code to make better decisions.

Reported-by: Norbert Preining 
Signed-off-by: Peter Zijlstra 
Acked-by: Mike Galbraith 
Cc: Arjan van de Ven 
LKML-Reference: 
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

CRED: Fix __task_cred()'s lockdep check and banner comment

2010-07-29T22:16:18+00:00

Fix __task_cred()'s lockdep check by removing the following validation
condition:

	lockdep_tasklist_lock_is_held()

as commit_creds() does not take the tasklist_lock, and nor do most of the
functions that call it, so this check is pointless and it can prevent
detection of the RCU lock not being held if the tasklist_lock is held.

Instead, add the following validation condition:

	task->exit_state >= 0

to permit the access if the target task is dead and therefore unable to change
its own credentials.

Fix __task_cred()'s comment to:

 (1) discard the bit that says that the caller must prevent the target task
     from being deleted.  That shouldn't need saying.

 (2) Add a comment indicating the result of __task_cred() should not be passed
     directly to get_cred(), but rather than get_task_cred() should be used
     instead.

Also put a note into the documentation to enforce this point there too.

Signed-off-by: David Howells 
Acked-by: Jiri Olsa 
Cc: Paul E. McKenney 
Signed-off-by: Linus Torvalds

sched: Cure nr_iowait_cpu() users

2010-07-01T07:39:48+00:00

Commit 0224cf4c5e (sched: Intoduce get_cpu_iowait_time_us())
broke things by not making sure preemption was indeed disabled
by the callers of nr_iowait_cpu() which took the iowait value of
the current cpu.

This resulted in a heap of preempt warnings. Cure this by making
nr_iowait_cpu() take a cpu number and fix up the callers to pass
in the right number.

Signed-off-by: Peter Zijlstra 
Cc: Arjan van de Ven 
Cc: Sergey Senozhatsky 
Cc: Rafael J. Wysocki 
Cc: Maxim Levitsky 
Cc: Len Brown 
Cc: Pavel Machek 
Cc: Jiri Slaby 
Cc: linux-pm@lists.linux-foundation.org
LKML-Reference: <1277968037.1868.120.camel@laptop>
Signed-off-by: Ingo Molnar