<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/kernel/sched.c, branch linux-2.6.24.y</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>NOHZ: reevaluate idle sleep length after add_timer_on()</title>
<updated>2008-04-19T01:53:20+00:00</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2008-03-26T18:35:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=e8f696e9daa00e52b9c7ad1822fcda354d0baabd'/>
<id>e8f696e9daa00e52b9c7ad1822fcda354d0baabd</id>
<content type='text'>
upstream commit: 06d8308c61e54346585b2691c13ee3f90cb6fb2f

add_timer_on() can add a timer on a CPU which is currently in a long
idle sleep, but the timer wheel is not reevaluated by the nohz code on
that CPU. So a timer can be delayed for quite a long time. This
triggered a false positive in the clocksource watchdog code.

To avoid this we need to wake up the idle CPU and enforce the
reevaluation of the timer wheel for the next timer event.

Add a function, which checks a given CPU for idle state, marks the
idle task with NEED_RESCHED and sends a reschedule IPI to notify the
other CPU of the change in the timer wheel.

Call this function from add_timer_on().

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Acked-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Acked-by: Ingo Molnar &lt;mingo@elte.hu&gt;
Signed-off-by: Chris Wright &lt;chrisw@sous-sol.org&gt;
--
 include/linux/sched.h |    6 ++++++
 kernel/sched.c        |   43 +++++++++++++++++++++++++++++++++++++++++++
 kernel/timer.c        |   10 +++++++++-
 3 files changed, 58 insertions(+), 1 deletion(-)
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
upstream commit: 06d8308c61e54346585b2691c13ee3f90cb6fb2f

add_timer_on() can add a timer on a CPU which is currently in a long
idle sleep, but the timer wheel is not reevaluated by the nohz code on
that CPU. So a timer can be delayed for quite a long time. This
triggered a false positive in the clocksource watchdog code.

To avoid this we need to wake up the idle CPU and enforce the
reevaluation of the timer wheel for the next timer event.

Add a function, which checks a given CPU for idle state, marks the
idle task with NEED_RESCHED and sends a reschedule IPI to notify the
other CPU of the change in the timer wheel.

Call this function from add_timer_on().

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Acked-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Acked-by: Ingo Molnar &lt;mingo@elte.hu&gt;
Signed-off-by: Chris Wright &lt;chrisw@sous-sol.org&gt;
--
 include/linux/sched.h |    6 ++++++
 kernel/sched.c        |   43 +++++++++++++++++++++++++++++++++++++++++++
 kernel/timer.c        |   10 +++++++++-
 3 files changed, 58 insertions(+), 1 deletion(-)
</pre>
</div>
</content>
</entry>
<entry>
<title>sched: fix race in schedule()</title>
<updated>2008-03-24T18:47:43+00:00</updated>
<author>
<name>Hiroshi Shimamoto</name>
<email>h-shimamoto@ct.jp.nec.com</email>
</author>
<published>2008-03-10T18:01:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=ccab3340fa6f495c1932fad84163e3fab40094e1'/>
<id>ccab3340fa6f495c1932fad84163e3fab40094e1</id>
<content type='text'>
Fix a hard to trigger crash seen in the -rt kernel that also affects
the vanilla scheduler.

There is a race condition between schedule() and some dequeue/enqueue
functions; rt_mutex_setprio(), __setscheduler() and sched_move_task().

When scheduling to idle, idle_balance() is called to pull tasks from
other busy processor. It might drop the rq lock. It means that those 3
functions encounter on_rq=0 and running=1. The current task should be
put when running.

Here is a possible scenario:

   CPU0                               CPU1
    |                              schedule()
    |                              -&gt;deactivate_task()
    |                              -&gt;idle_balance()
    |                              --&gt;load_balance_newidle()
rt_mutex_setprio()                     |
    |                              ---&gt;double_lock_balance()
    *get lock                          *rel lock
    * on_rq=0, ruuning=1               |
    * sched_class is changed           |
    *rel lock                          *get lock
    :                                  |
                                       :
                                   -&gt;put_prev_task_rt()
                                   -&gt;pick_next_task_fair()
                                       =&gt; panic

The current process of CPU1(P1) is scheduling. Deactivated P1, and the
scheduler looks for another process on other CPU's runqueue because CPU1
will be idle. idle_balance(), load_balance_newidle() and
double_lock_balance() are called and double_lock_balance() could drop
the rq lock. On the other hand, CPU0 is trying to boost the priority of
P1. The result of boosting only P1's prio and sched_class are changed to
RT. The sched entities of P1 and P1's group are never put. It makes
cfs_rq invalid, because the cfs_rq has curr and no leaf, but
pick_next_task_fair() is called, then the kernel panics.

Signed-off-by: Hiroshi Shimamoto &lt;h-shimamoto@ct.jp.nec.com&gt;
Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
[chrisw@sous-sol.org: backport to 2.6.24.3]
Signed-off-by: Chris Wright &lt;chrisw@sous-sol.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Fix a hard to trigger crash seen in the -rt kernel that also affects
the vanilla scheduler.

There is a race condition between schedule() and some dequeue/enqueue
functions; rt_mutex_setprio(), __setscheduler() and sched_move_task().

When scheduling to idle, idle_balance() is called to pull tasks from
other busy processor. It might drop the rq lock. It means that those 3
functions encounter on_rq=0 and running=1. The current task should be
put when running.

Here is a possible scenario:

   CPU0                               CPU1
    |                              schedule()
    |                              -&gt;deactivate_task()
    |                              -&gt;idle_balance()
    |                              --&gt;load_balance_newidle()
rt_mutex_setprio()                     |
    |                              ---&gt;double_lock_balance()
    *get lock                          *rel lock
    * on_rq=0, ruuning=1               |
    * sched_class is changed           |
    *rel lock                          *get lock
    :                                  |
                                       :
                                   -&gt;put_prev_task_rt()
                                   -&gt;pick_next_task_fair()
                                       =&gt; panic

The current process of CPU1(P1) is scheduling. Deactivated P1, and the
scheduler looks for another process on other CPU's runqueue because CPU1
will be idle. idle_balance(), load_balance_newidle() and
double_lock_balance() are called and double_lock_balance() could drop
the rq lock. On the other hand, CPU0 is trying to boost the priority of
P1. The result of boosting only P1's prio and sched_class are changed to
RT. The sched entities of P1 and P1's group are never put. It makes
cfs_rq invalid, because the cfs_rq has curr and no leaf, but
pick_next_task_fair() is called, then the kernel panics.

Signed-off-by: Hiroshi Shimamoto &lt;h-shimamoto@ct.jp.nec.com&gt;
Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
[chrisw@sous-sol.org: backport to 2.6.24.3]
Signed-off-by: Chris Wright &lt;chrisw@sous-sol.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched: group scheduler, set uid share fix</title>
<updated>2008-01-22T10:24:58+00:00</updated>
<author>
<name>Ingo Molnar</name>
<email>mingo@elte.hu</email>
</author>
<published>2008-01-22T10:24:58+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=c61935fd0e7f087a643827b4bf5ef646963c10fa'/>
<id>c61935fd0e7f087a643827b4bf5ef646963c10fa</id>
<content type='text'>
setting cpu share to 1 causes hangs, as reported in:

    http://bugzilla.kernel.org/show_bug.cgi?id=9779

as the default share is 1024, the values of 0 and 1 can indeed
cause problems. Limit it to 2 or higher values.

These values can only be set by the root user - but still it
makes sense to protect against nonsensical values.

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
setting cpu share to 1 causes hangs, as reported in:

    http://bugzilla.kernel.org/show_bug.cgi?id=9779

as the default share is 1024, the values of 0 and 1 can indeed
cause problems. Limit it to 2 or higher values.

These values can only be set by the root user - but still it
makes sense to protect against nonsensical values.

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>show_task: real_parent</title>
<updated>2008-01-09T16:03:58+00:00</updated>
<author>
<name>Roland McGrath</name>
<email>roland@redhat.com</email>
</author>
<published>2008-01-09T08:03:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=fcfd50afb6e94c8cf121ca4e7e3e7166bae7c6aa'/>
<id>fcfd50afb6e94c8cf121ca4e7e3e7166bae7c6aa</id>
<content type='text'>
The show_task function invoked by sysrq-t et al displays the
pid and parent's pid of each task.  It seems more useful to
show the actual process hierarchy here than who is using
ptrace on each process.

Signed-off-by: Roland McGrath &lt;roland@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The show_task function invoked by sysrq-t et al displays the
pid and parent's pid of each task.  It seems more useful to
show the actual process hierarchy here than who is using
ptrace on each process.

Signed-off-by: Roland McGrath &lt;roland@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched: touch softlockup watchdog after idling</title>
<updated>2007-12-18T14:21:13+00:00</updated>
<author>
<name>Ingo Molnar</name>
<email>mingo@elte.hu</email>
</author>
<published>2007-12-18T14:21:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=2bacec8c318ca0418c0ee9ac662ee44207765dd4'/>
<id>2bacec8c318ca0418c0ee9ac662ee44207765dd4</id>
<content type='text'>
touch softlockup watchdog after idling.

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
touch softlockup watchdog after idling.

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched: fix crash on ia64, introduce task_current()</title>
<updated>2007-12-18T14:21:13+00:00</updated>
<author>
<name>Dmitry Adamushko</name>
<email>dmitry.adamushko@gmail.com</email>
</author>
<published>2007-12-18T14:21:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=051a1d1afa47206e23ae03f781c6795ce870e3d5'/>
<id>051a1d1afa47206e23ae03f781c6795ce870e3d5</id>
<content type='text'>
Some services (e.g. sched_setscheduler(), rt_mutex_setprio() and
sched_move_task()) must handle a given task differently in case it's the
'rq-&gt;curr' task on its run-queue. The task_running() interface is not
suitable for determining such tasks for platforms with one of the
following options:

#define __ARCH_WANT_UNLOCKED_CTXSW
#define __ARCH_WANT_INTERRUPTS_ON_CTXSW

Due to the fact that it makes use of 'p-&gt;oncpu == 1' as a criterion but
such a task is not necessarily 'rq-&gt;curr'.

The detailed explanation is available here:
https://lists.linux-foundation.org/pipermail/containers/2007-December/009262.html

Signed-off-by: Dmitry Adamushko &lt;dmitry.adamushko@gmail.com&gt;
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
Tested-by: Dhaval Giani &lt;dhaval@linux.vnet.ibm.com&gt;
Tested-by: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Some services (e.g. sched_setscheduler(), rt_mutex_setprio() and
sched_move_task()) must handle a given task differently in case it's the
'rq-&gt;curr' task on its run-queue. The task_running() interface is not
suitable for determining such tasks for platforms with one of the
following options:

#define __ARCH_WANT_UNLOCKED_CTXSW
#define __ARCH_WANT_INTERRUPTS_ON_CTXSW

Due to the fact that it makes use of 'p-&gt;oncpu == 1' as a criterion but
such a task is not necessarily 'rq-&gt;curr'.

The detailed explanation is available here:
https://lists.linux-foundation.org/pipermail/containers/2007-December/009262.html

Signed-off-by: Dmitry Adamushko &lt;dmitry.adamushko@gmail.com&gt;
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
Tested-by: Dhaval Giani &lt;dhaval@linux.vnet.ibm.com&gt;
Tested-by: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched: enable early use of sched_clock()</title>
<updated>2007-12-07T18:02:47+00:00</updated>
<author>
<name>Ingo Molnar</name>
<email>mingo@elte.hu</email>
</author>
<published>2007-12-07T18:02:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=8ced5f69e4bc09adcc6442e090e2e64c197246cf'/>
<id>8ced5f69e4bc09adcc6442e090e2e64c197246cf</id>
<content type='text'>
some platforms have sched_clock() implementations that cannot be called
very early during wakeup. If it's called it might hang or crash in hard
to debug ways. So only call update_rq_clock() [which calls sched_clock()]
if sched_init() has already been called. (rq-&gt;idle is NULL before the
scheduler is initialized.)

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
some platforms have sched_clock() implementations that cannot be called
very early during wakeup. If it's called it might hang or crash in hard
to debug ways. So only call update_rq_clock() [which calls sched_clock()]
if sched_init() has already been called. (rq-&gt;idle is NULL before the
scheduler is initialized.)

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched: style cleanups</title>
<updated>2007-12-05T14:46:09+00:00</updated>
<author>
<name>Ingo Molnar</name>
<email>mingo@elte.hu</email>
</author>
<published>2007-12-05T14:46:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=41a2d6cfa3f77ec469e7e5f06b4d7ffd031f9c0e'/>
<id>41a2d6cfa3f77ec469e7e5f06b4d7ffd031f9c0e</id>
<content type='text'>
style cleanup of various changes that were done recently.

no code changed:

      text    data     bss     dec     hex filename
     23680    2542      28   26250    668a sched.o.before
     23680    2542      28   26250    668a sched.o.after

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
style cleanup of various changes that were done recently.

no code changed:

      text    data     bss     dec     hex filename
     23680    2542      28   26250    668a sched.o.before
     23680    2542      28   26250    668a sched.o.after

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched: fix crash in sys_sched_rr_get_interval()</title>
<updated>2007-12-04T16:04:39+00:00</updated>
<author>
<name>Ingo Molnar</name>
<email>mingo@elte.hu</email>
</author>
<published>2007-12-04T16:04:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=77034937dc4575ca0a76bf209838ecd39e804089'/>
<id>77034937dc4575ca0a76bf209838ecd39e804089</id>
<content type='text'>
Luiz Fernando N. Capitulino reported that sched_rr_get_interval()
crashes for SCHED_OTHER tasks that are on an idle runqueue.

The fix is to return a 0 timeslice for tasks that are on an idle
runqueue. (and which are not running, obviously)

this also shrinks the code a bit:

   text    data     bss     dec     hex filename
  47903    3934     336   52173    cbcd sched.o.before
  47885    3934     336   52155    cbbb sched.o.after

Reported-by: Luiz Fernando N. Capitulino &lt;lcapitulino@mandriva.com.br&gt;
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Luiz Fernando N. Capitulino reported that sched_rr_get_interval()
crashes for SCHED_OTHER tasks that are on an idle runqueue.

The fix is to return a 0 timeslice for tasks that are on an idle
runqueue. (and which are not running, obviously)

this also shrinks the code a bit:

   text    data     bss     dec     hex filename
  47903    3934     336   52173    cbcd sched.o.before
  47885    3934     336   52155    cbbb sched.o.after

Reported-by: Luiz Fernando N. Capitulino &lt;lcapitulino@mandriva.com.br&gt;
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched: cpu accounting controller (V2)</title>
<updated>2007-12-02T19:04:49+00:00</updated>
<author>
<name>Srivatsa Vaddagiri</name>
<email>vatsa@linux.vnet.ibm.com</email>
</author>
<published>2007-12-02T19:04:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=d842de871c8c5e2110c7e4f3f29bbe7b1a519ab8'/>
<id>d842de871c8c5e2110c7e4f3f29bbe7b1a519ab8</id>
<content type='text'>
Commit cfb5285660aad4931b2ebbfa902ea48a37dfffa1 removed a useful feature for
us, which provided a cpu accounting resource controller.  This feature would be
useful if someone wants to group tasks only for accounting purpose and doesnt
really want to exercise any control over their cpu consumption.

The patch below reintroduces the feature. It is based on Paul Menage's
original patch (Commit 62d0df64065e7c135d0002f069444fbdfc64768f), with
these differences:

        - Removed load average information. I felt it needs more thought (esp
	  to deal with SMP and virtualized platforms) and can be added for
	  2.6.25 after more discussions.
        - Convert group cpu usage to be nanosecond accurate (as rest of the cfs
	  stats are) and invoke cpuacct_charge() from the respective scheduler
	  classes
	- Make accounting scalable on SMP systems by splitting the usage
	  counter to be per-cpu
	- Move the code from kernel/cpu_acct.c to kernel/sched.c (since the
	  code is not big enough to warrant a new file and also this rightly
	  needs to live inside the scheduler. Also things like accessing
	  rq-&gt;lock while reading cpu usage becomes easier if the code lived in
	  kernel/sched.c)

The patch also modifies the cpu controller not to provide the same accounting
information.

Tested-by: Balbir Singh &lt;balbir@linux.vnet.ibm.com&gt;

 Tested the patches on top of 2.6.24-rc3. The patches work fine. Ran
 some simple tests like cpuspin (spin on the cpu), ran several tasks in
 the same group and timed them. Compared their time stamps with
 cpuacct.usage.

Signed-off-by: Srivatsa Vaddagiri &lt;vatsa@linux.vnet.ibm.com&gt;
Signed-off-by: Balbir Singh &lt;balbir@linux.vnet.ibm.com&gt;
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Commit cfb5285660aad4931b2ebbfa902ea48a37dfffa1 removed a useful feature for
us, which provided a cpu accounting resource controller.  This feature would be
useful if someone wants to group tasks only for accounting purpose and doesnt
really want to exercise any control over their cpu consumption.

The patch below reintroduces the feature. It is based on Paul Menage's
original patch (Commit 62d0df64065e7c135d0002f069444fbdfc64768f), with
these differences:

        - Removed load average information. I felt it needs more thought (esp
	  to deal with SMP and virtualized platforms) and can be added for
	  2.6.25 after more discussions.
        - Convert group cpu usage to be nanosecond accurate (as rest of the cfs
	  stats are) and invoke cpuacct_charge() from the respective scheduler
	  classes
	- Make accounting scalable on SMP systems by splitting the usage
	  counter to be per-cpu
	- Move the code from kernel/cpu_acct.c to kernel/sched.c (since the
	  code is not big enough to warrant a new file and also this rightly
	  needs to live inside the scheduler. Also things like accessing
	  rq-&gt;lock while reading cpu usage becomes easier if the code lived in
	  kernel/sched.c)

The patch also modifies the cpu controller not to provide the same accounting
information.

Tested-by: Balbir Singh &lt;balbir@linux.vnet.ibm.com&gt;

 Tested the patches on top of 2.6.24-rc3. The patches work fine. Ran
 some simple tests like cpuspin (spin on the cpu), ran several tasks in
 the same group and timed them. Compared their time stamps with
 cpuacct.usage.

Signed-off-by: Srivatsa Vaddagiri &lt;vatsa@linux.vnet.ibm.com&gt;
Signed-off-by: Balbir Singh &lt;balbir@linux.vnet.ibm.com&gt;
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
</pre>
</div>
</content>
</entry>
</feed>
