<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/kernel/sched/fair.c, branch v3.19</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>sched/fair: Avoid using uninitialized variable in preferred_group_nid()</title>
<updated>2015-01-28T12:14:12+00:00</updated>
<author>
<name>Jan Beulich</name>
<email>JBeulich@suse.com</email>
</author>
<published>2015-01-23T08:25:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=81907478c4311a679849216abf723999184ab984'/>
<id>81907478c4311a679849216abf723999184ab984</id>
<content type='text'>
At least some gcc versions - validly afaict - warn about potentially
using max_group uninitialized: There's no way the compiler can prove
that the body of the conditional where it and max_faults get set/
updated gets executed; in fact, without knowing all the details of
other scheduler code, I can't prove this either.

Generally the necessary change would appear to be to clear max_group
prior to entering the inner loop, and break out of the outer loop when
it ends up being all clear after the inner one. This, however, seems
inefficient, and afaict the same effect can be achieved by exiting the
outer loop when max_faults is still zero after the inner loop.

[ mingo: changed the solution to zero initialization: uninitialized_var()
  needs to die, as it's an actively dangerous construct: if in the future
  a known-proven-good piece of code is changed to have a true, buggy
  uninitialized variable, the compiler warning is then supressed...

  The better long term solution is to clean up the code flow, so that
  even simple minded compilers (and humans!) are able to read it without
  getting a headache.  ]

Signed-off-by: Jan Beulich &lt;jbeulich@suse.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/54C2139202000078000588F7@mail.emea.novell.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
At least some gcc versions - validly afaict - warn about potentially
using max_group uninitialized: There's no way the compiler can prove
that the body of the conditional where it and max_faults get set/
updated gets executed; in fact, without knowing all the details of
other scheduler code, I can't prove this either.

Generally the necessary change would appear to be to clear max_group
prior to entering the inner loop, and break out of the outer loop when
it ends up being all clear after the inner one. This, however, seems
inefficient, and afaict the same effect can be achieved by exiting the
outer loop when max_faults is still zero after the inner loop.

[ mingo: changed the solution to zero initialization: uninitialized_var()
  needs to die, as it's an actively dangerous construct: if in the future
  a known-proven-good piece of code is changed to have a true, buggy
  uninitialized variable, the compiler warning is then supressed...

  The better long term solution is to clean up the code flow, so that
  even simple minded compilers (and humans!) are able to read it without
  getting a headache.  ]

Signed-off-by: Jan Beulich &lt;jbeulich@suse.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/54C2139202000078000588F7@mail.emea.novell.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched/fair: Fix RCU stall upon -ENOMEM in sched_create_group()</title>
<updated>2015-01-09T10:19:00+00:00</updated>
<author>
<name>Tetsuo Handa</name>
<email>penguin-kernel@I-love.SAKURA.ne.jp</email>
</author>
<published>2014-12-25T06:51:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=7f1a169b88f513e32a432ca0f85bfd282d117bd6'/>
<id>7f1a169b88f513e32a432ca0f85bfd282d117bd6</id>
<content type='text'>
When alloc_fair_sched_group() in sched_create_group() fails,
free_sched_group() is called, and free_fair_sched_group() is called by
free_sched_group(). Since destroy_cfs_bandwidth() is called by
free_fair_sched_group() without calling init_cfs_bandwidth(),
RCU stall occurs at hrtimer_cancel():

  INFO: rcu_sched self-detected stall on CPU { 1}  (t=60000 jiffies g=13074 c=13073 q=0)
  Task dump for CPU 1:
  (fprintd)       R  running task        0  6249      1 0x00000088
  ...
  Call Trace:
   &lt;IRQ&gt;  [&lt;ffffffff81094988&gt;] sched_show_task+0xa8/0x110
   [&lt;ffffffff81097acd&gt;] dump_cpu_task+0x3d/0x50
   [&lt;ffffffff810c3a80&gt;] rcu_dump_cpu_stacks+0x90/0xd0
   [&lt;ffffffff810c7751&gt;] rcu_check_callbacks+0x491/0x700
   [&lt;ffffffff810cbf2b&gt;] update_process_times+0x4b/0x80
   [&lt;ffffffff810db046&gt;] tick_sched_handle.isra.20+0x36/0x50
   [&lt;ffffffff810db0a2&gt;] tick_sched_timer+0x42/0x70
   [&lt;ffffffff810ccb19&gt;] __run_hrtimer+0x69/0x1a0
   [&lt;ffffffff810db060&gt;] ? tick_sched_handle.isra.20+0x50/0x50
   [&lt;ffffffff810ccedf&gt;] hrtimer_interrupt+0xef/0x230
   [&lt;ffffffff810452cb&gt;] local_apic_timer_interrupt+0x3b/0x70
   [&lt;ffffffff8164a465&gt;] smp_apic_timer_interrupt+0x45/0x60
   [&lt;ffffffff816485bd&gt;] apic_timer_interrupt+0x6d/0x80
   &lt;EOI&gt;  [&lt;ffffffff810cc588&gt;] ? lock_hrtimer_base.isra.23+0x18/0x50
   [&lt;ffffffff81193cf1&gt;] ? __kmalloc+0x211/0x230
   [&lt;ffffffff810cc9d2&gt;] hrtimer_try_to_cancel+0x22/0xd0
   [&lt;ffffffff81193cf1&gt;] ? __kmalloc+0x211/0x230
   [&lt;ffffffff810ccaa2&gt;] hrtimer_cancel+0x22/0x30
   [&lt;ffffffff810a3cb5&gt;] free_fair_sched_group+0x25/0xd0
   [&lt;ffffffff8108df46&gt;] free_sched_group+0x16/0x40
   [&lt;ffffffff810971bb&gt;] sched_create_group+0x4b/0x80
   [&lt;ffffffff810aa383&gt;] sched_autogroup_create_attach+0x43/0x1c0
   [&lt;ffffffff8107dc9c&gt;] sys_setsid+0x7c/0x110
   [&lt;ffffffff81647729&gt;] system_call_fastpath+0x12/0x17

Check whether init_cfs_bandwidth() was called before calling
destroy_cfs_bandwidth().

Signed-off-by: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
[ Move the check into destroy_cfs_bandwidth() to aid compilability. ]
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Paul Turner &lt;pjt@google.com&gt;
Cc: Ben Segall &lt;bsegall@google.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/201412252210.GCC30204.SOMVFFOtQJFLOH@I-love.SAKURA.ne.jp
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When alloc_fair_sched_group() in sched_create_group() fails,
free_sched_group() is called, and free_fair_sched_group() is called by
free_sched_group(). Since destroy_cfs_bandwidth() is called by
free_fair_sched_group() without calling init_cfs_bandwidth(),
RCU stall occurs at hrtimer_cancel():

  INFO: rcu_sched self-detected stall on CPU { 1}  (t=60000 jiffies g=13074 c=13073 q=0)
  Task dump for CPU 1:
  (fprintd)       R  running task        0  6249      1 0x00000088
  ...
  Call Trace:
   &lt;IRQ&gt;  [&lt;ffffffff81094988&gt;] sched_show_task+0xa8/0x110
   [&lt;ffffffff81097acd&gt;] dump_cpu_task+0x3d/0x50
   [&lt;ffffffff810c3a80&gt;] rcu_dump_cpu_stacks+0x90/0xd0
   [&lt;ffffffff810c7751&gt;] rcu_check_callbacks+0x491/0x700
   [&lt;ffffffff810cbf2b&gt;] update_process_times+0x4b/0x80
   [&lt;ffffffff810db046&gt;] tick_sched_handle.isra.20+0x36/0x50
   [&lt;ffffffff810db0a2&gt;] tick_sched_timer+0x42/0x70
   [&lt;ffffffff810ccb19&gt;] __run_hrtimer+0x69/0x1a0
   [&lt;ffffffff810db060&gt;] ? tick_sched_handle.isra.20+0x50/0x50
   [&lt;ffffffff810ccedf&gt;] hrtimer_interrupt+0xef/0x230
   [&lt;ffffffff810452cb&gt;] local_apic_timer_interrupt+0x3b/0x70
   [&lt;ffffffff8164a465&gt;] smp_apic_timer_interrupt+0x45/0x60
   [&lt;ffffffff816485bd&gt;] apic_timer_interrupt+0x6d/0x80
   &lt;EOI&gt;  [&lt;ffffffff810cc588&gt;] ? lock_hrtimer_base.isra.23+0x18/0x50
   [&lt;ffffffff81193cf1&gt;] ? __kmalloc+0x211/0x230
   [&lt;ffffffff810cc9d2&gt;] hrtimer_try_to_cancel+0x22/0xd0
   [&lt;ffffffff81193cf1&gt;] ? __kmalloc+0x211/0x230
   [&lt;ffffffff810ccaa2&gt;] hrtimer_cancel+0x22/0x30
   [&lt;ffffffff810a3cb5&gt;] free_fair_sched_group+0x25/0xd0
   [&lt;ffffffff8108df46&gt;] free_sched_group+0x16/0x40
   [&lt;ffffffff810971bb&gt;] sched_create_group+0x4b/0x80
   [&lt;ffffffff810aa383&gt;] sched_autogroup_create_attach+0x43/0x1c0
   [&lt;ffffffff8107dc9c&gt;] sys_setsid+0x7c/0x110
   [&lt;ffffffff81647729&gt;] system_call_fastpath+0x12/0x17

Check whether init_cfs_bandwidth() was called before calling
destroy_cfs_bandwidth().

Signed-off-by: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
[ Move the check into destroy_cfs_bandwidth() to aid compilability. ]
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Paul Turner &lt;pjt@google.com&gt;
Cc: Ben Segall &lt;bsegall@google.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/201412252210.GCC30204.SOMVFFOtQJFLOH@I-love.SAKURA.ne.jp
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched: Fix odd values in effective_load() calculations</title>
<updated>2015-01-09T10:18:54+00:00</updated>
<author>
<name>Yuyang Du</name>
<email>yuyang.du@intel.com</email>
</author>
<published>2014-12-19T00:29:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=32a8df4e0b33fccc9715213b382160415b5c4008'/>
<id>32a8df4e0b33fccc9715213b382160415b5c4008</id>
<content type='text'>
In effective_load, we have (long w * unsigned long tg-&gt;shares) / long W,
when w is negative, it is cast to unsigned long and hence the product is
insanely large. Fix this by casting tg-&gt;shares to long.

Reported-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
Signed-off-by: Yuyang Du &lt;yuyang.du@intel.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Dave Jones &lt;davej@redhat.com&gt;
Cc: Andrey Ryabinin &lt;a.ryabinin@samsung.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/20141219002956.GA25405@intel.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In effective_load, we have (long w * unsigned long tg-&gt;shares) / long W,
when w is negative, it is cast to unsigned long and hence the product is
insanely large. Fix this by casting tg-&gt;shares to long.

Reported-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
Signed-off-by: Yuyang Du &lt;yuyang.du@intel.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Dave Jones &lt;davej@redhat.com&gt;
Cc: Andrey Ryabinin &lt;a.ryabinin@samsung.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/20141219002956.GA25405@intel.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched/fair: Fix stale overloaded status in the busiest group finding logic</title>
<updated>2014-11-16T09:58:56+00:00</updated>
<author>
<name>Wanpeng Li</name>
<email>wanpeng.li@linux.intel.com</email>
</author>
<published>2014-11-04T23:44:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=cb0b9f2445cdf9893352e4548582a2892af7137c'/>
<id>cb0b9f2445cdf9893352e4548582a2892af7137c</id>
<content type='text'>
Commit caeb178c60f4 ("sched/fair: Make update_sd_pick_busiest() return
'true' on a busier sd") changes groups to be ranked in the order of
overloaded &gt; imbalance &gt; other, and busiest group is picked according
to this order.

sgs-&gt;group_capacity_factor is used to check if the group is overloaded.

When the child domain prefers tasks to go to siblings first, the
sgs-&gt;group_capacity_factor will be set lower than one in order to
move all the excess tasks away.

However, group overloaded status is not updated when
sgs-&gt;group_capacity_factor is set to lower than one, which leads to us
missing to find the busiest group.

This patch fixes it by updating group overloaded status when sg capacity
factor is set to one, in order to find the busiest group accurately.

Signed-off-by: Wanpeng Li &lt;wanpeng.li@linux.intel.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Vincent Guittot &lt;vincent.guittot@linaro.org&gt;
Cc: Kirill Tkhai &lt;ktkhai@parallels.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/1415144690-25196-1-git-send-email-wanpeng.li@linux.intel.com
[ Fixed the changelog. ]
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Commit caeb178c60f4 ("sched/fair: Make update_sd_pick_busiest() return
'true' on a busier sd") changes groups to be ranked in the order of
overloaded &gt; imbalance &gt; other, and busiest group is picked according
to this order.

sgs-&gt;group_capacity_factor is used to check if the group is overloaded.

When the child domain prefers tasks to go to siblings first, the
sgs-&gt;group_capacity_factor will be set lower than one in order to
move all the excess tasks away.

However, group overloaded status is not updated when
sgs-&gt;group_capacity_factor is set to lower than one, which leads to us
missing to find the busiest group.

This patch fixes it by updating group overloaded status when sg capacity
factor is set to one, in order to find the busiest group accurately.

Signed-off-by: Wanpeng Li &lt;wanpeng.li@linux.intel.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Vincent Guittot &lt;vincent.guittot@linaro.org&gt;
Cc: Kirill Tkhai &lt;ktkhai@parallels.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/1415144690-25196-1-git-send-email-wanpeng.li@linux.intel.com
[ Fixed the changelog. ]
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched: Move p-&gt;nr_cpus_allowed check to select_task_rq()</title>
<updated>2014-11-16T09:58:55+00:00</updated>
<author>
<name>Wanpeng Li</name>
<email>wanpeng.li@linux.intel.com</email>
</author>
<published>2014-11-05T01:14:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=6c1d9410f007a26d13173cf17204cfd965f49b83'/>
<id>6c1d9410f007a26d13173cf17204cfd965f49b83</id>
<content type='text'>
Move the p-&gt;nr_cpus_allowed check into kernel/sched/core.c: select_task_rq().
This change will make fair.c, rt.c, and deadline.c all start with the
same logic.

Suggested-and-Acked-by: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Signed-off-by: Wanpeng Li &lt;wanpeng.li@linux.intel.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: "pang.xunlei" &lt;pang.xunlei@linaro.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Move the p-&gt;nr_cpus_allowed check into kernel/sched/core.c: select_task_rq().
This change will make fair.c, rt.c, and deadline.c all start with the
same logic.

Suggested-and-Acked-by: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Signed-off-by: Wanpeng Li &lt;wanpeng.li@linux.intel.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: "pang.xunlei" &lt;pang.xunlei@linaro.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched/fair: Kill task_struct::numa_entry and numa_group::task_list</title>
<updated>2014-11-16T09:58:48+00:00</updated>
<author>
<name>Kirill Tkhai</name>
<email>ktkhai@parallels.com</email>
</author>
<published>2014-11-07T11:07:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=753899183c53aa609375b214ea8e040da89119c3'/>
<id>753899183c53aa609375b214ea8e040da89119c3</id>
<content type='text'>
Nobody iterates over numa_group::task_list, this just confuses the readers.

Signed-off-by: Kirill Tkhai &lt;ktkhai@parallels.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/1415358456.28592.17.camel@tkhai
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Nobody iterates over numa_group::task_list, this just confuses the readers.

Signed-off-by: Kirill Tkhai &lt;ktkhai@parallels.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/1415358456.28592.17.camel@tkhai
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'sched/urgent' into sched/core, to pick up fixes before applying more changes</title>
<updated>2014-11-16T09:50:25+00:00</updated>
<author>
<name>Ingo Molnar</name>
<email>mingo@kernel.org</email>
</author>
<published>2014-11-16T09:50:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=e9ac5f0fa8549dffe2a15870217a9c2e7cd557ec'/>
<id>e9ac5f0fa8549dffe2a15870217a9c2e7cd557ec</id>
<content type='text'>
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency</title>
<updated>2014-11-16T09:04:20+00:00</updated>
<author>
<name>Stanislaw Gruszka</name>
<email>sgruszka@redhat.com</email>
</author>
<published>2014-11-12T15:58:44+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=6e998916dfe327e785e7c2447959b2c1a3ea4930'/>
<id>6e998916dfe327e785e7c2447959b2c1a3ea4930</id>
<content type='text'>
Commit d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
test case in cost of breaking another one. After that commit, calling
clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&amp;Y) can result
of Y time being smaller than X time.

Reproducer/tester can be found further below, it can be compiled and ran by:

	gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
	while ./tst-cpuclock2 ; do : ; done

This reproducer, when running on a buggy kernel, will complain
about "clock_gettime difference too small".

Issue happens because on start in thread_group_cputimer() we initialize
sum_exec_runtime of cputimer with threads runtime not yet accounted and
then add the threads runtime to running cputimer again on scheduler
tick, making it's sum_exec_runtime bigger than actual threads runtime.

KOSAKI Motohiro posted a fix for this problem, but that patch was never
applied: https://lkml.org/lkml/2013/5/26/191 .

This patch takes different approach to cure the problem. It calls
update_curr() when cputimer starts, that assure we will have updated
stats of running threads and on the next schedule tick we will account
only the runtime that elapsed from cputimer start. That also assure we
have consistent state between cpu times of individual threads and cpu
time of the process consisted by those threads.

Full reproducer (tst-cpuclock2.c):

	#define _GNU_SOURCE
	#include &lt;unistd.h&gt;
	#include &lt;sys/syscall.h&gt;
	#include &lt;stdio.h&gt;
	#include &lt;time.h&gt;
	#include &lt;pthread.h&gt;
	#include &lt;stdint.h&gt;
	#include &lt;inttypes.h&gt;

	/* Parameters for the Linux kernel ABI for CPU clocks.  */
	#define CPUCLOCK_SCHED          2
	#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
		((~(clockid_t) (pid) &lt;&lt; 3) | (clockid_t) (clock))

	static pthread_barrier_t barrier;

	/* Help advance the clock.  */
	static void *chew_cpu(void *arg)
	{
		pthread_barrier_wait(&amp;barrier);
		while (1) ;

		return NULL;
	}

	/* Don't use the glibc wrapper.  */
	static int do_nanosleep(int flags, const struct timespec *req)
	{
		clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);

		return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
	}

	static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
	{
		int64_t before_i = before-&gt;tv_sec * 1000000000ULL + before-&gt;tv_nsec;
		int64_t after_i = after-&gt;tv_sec * 1000000000ULL + after-&gt;tv_nsec;

		return after_i - before_i;
	}

	int main(void)
	{
		int result = 0;
		pthread_t th;

		pthread_barrier_init(&amp;barrier, NULL, 2);

		if (pthread_create(&amp;th, NULL, chew_cpu, NULL) != 0) {
			perror("pthread_create");
			return 1;
		}

		pthread_barrier_wait(&amp;barrier);

		/* The test.  */
		struct timespec before, after, sleeptimeabs;
		int64_t sleepdiff, diffabs;
		const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };

		/* The relative nanosleep.  Not sure why this is needed, but its presence
		   seems to make it easier to reproduce the problem.  */
		if (do_nanosleep(0, &amp;sleeptime) != 0) {
			perror("clock_nanosleep");
			return 1;
		}

		/* Get the current time.  */
		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &amp;before) &lt; 0) {
			perror("clock_gettime[2]");
			return 1;
		}

		/* Compute the absolute sleep time based on the current time.  */
		uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
		sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
		sleeptimeabs.tv_nsec = nsec % 1000000000;

		/* Sleep for the computed time.  */
		if (do_nanosleep(TIMER_ABSTIME, &amp;sleeptimeabs) != 0) {
			perror("absolute clock_nanosleep");
			return 1;
		}

		/* Get the time after the sleep.  */
		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &amp;after) &lt; 0) {
			perror("clock_gettime[3]");
			return 1;
		}

		/* The time after sleep should always be equal to or after the absolute sleep
		   time passed to clock_nanosleep.  */
		sleepdiff = tsdiff(&amp;sleeptimeabs, &amp;after);
		if (sleepdiff &lt; 0) {
			printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
			result = 1;

			printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
			printf("After  %llu.%09llu\n", after.tv_sec, after.tv_nsec);
			printf("Sleep  %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
		}

		/* The difference between the timestamps taken before and after the
		   clock_nanosleep call should be equal to or more than the duration of the
		   sleep.  */
		diffabs = tsdiff(&amp;before, &amp;after);
		if (diffabs &lt; sleeptime.tv_nsec) {
			printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
			result = 1;
		}

		pthread_cancel(th);

		return result;
	}

Signed-off-by: Stanislaw Gruszka &lt;sgruszka@redhat.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Commit d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
test case in cost of breaking another one. After that commit, calling
clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&amp;Y) can result
of Y time being smaller than X time.

Reproducer/tester can be found further below, it can be compiled and ran by:

	gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
	while ./tst-cpuclock2 ; do : ; done

This reproducer, when running on a buggy kernel, will complain
about "clock_gettime difference too small".

Issue happens because on start in thread_group_cputimer() we initialize
sum_exec_runtime of cputimer with threads runtime not yet accounted and
then add the threads runtime to running cputimer again on scheduler
tick, making it's sum_exec_runtime bigger than actual threads runtime.

KOSAKI Motohiro posted a fix for this problem, but that patch was never
applied: https://lkml.org/lkml/2013/5/26/191 .

This patch takes different approach to cure the problem. It calls
update_curr() when cputimer starts, that assure we will have updated
stats of running threads and on the next schedule tick we will account
only the runtime that elapsed from cputimer start. That also assure we
have consistent state between cpu times of individual threads and cpu
time of the process consisted by those threads.

Full reproducer (tst-cpuclock2.c):

	#define _GNU_SOURCE
	#include &lt;unistd.h&gt;
	#include &lt;sys/syscall.h&gt;
	#include &lt;stdio.h&gt;
	#include &lt;time.h&gt;
	#include &lt;pthread.h&gt;
	#include &lt;stdint.h&gt;
	#include &lt;inttypes.h&gt;

	/* Parameters for the Linux kernel ABI for CPU clocks.  */
	#define CPUCLOCK_SCHED          2
	#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
		((~(clockid_t) (pid) &lt;&lt; 3) | (clockid_t) (clock))

	static pthread_barrier_t barrier;

	/* Help advance the clock.  */
	static void *chew_cpu(void *arg)
	{
		pthread_barrier_wait(&amp;barrier);
		while (1) ;

		return NULL;
	}

	/* Don't use the glibc wrapper.  */
	static int do_nanosleep(int flags, const struct timespec *req)
	{
		clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);

		return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
	}

	static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
	{
		int64_t before_i = before-&gt;tv_sec * 1000000000ULL + before-&gt;tv_nsec;
		int64_t after_i = after-&gt;tv_sec * 1000000000ULL + after-&gt;tv_nsec;

		return after_i - before_i;
	}

	int main(void)
	{
		int result = 0;
		pthread_t th;

		pthread_barrier_init(&amp;barrier, NULL, 2);

		if (pthread_create(&amp;th, NULL, chew_cpu, NULL) != 0) {
			perror("pthread_create");
			return 1;
		}

		pthread_barrier_wait(&amp;barrier);

		/* The test.  */
		struct timespec before, after, sleeptimeabs;
		int64_t sleepdiff, diffabs;
		const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };

		/* The relative nanosleep.  Not sure why this is needed, but its presence
		   seems to make it easier to reproduce the problem.  */
		if (do_nanosleep(0, &amp;sleeptime) != 0) {
			perror("clock_nanosleep");
			return 1;
		}

		/* Get the current time.  */
		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &amp;before) &lt; 0) {
			perror("clock_gettime[2]");
			return 1;
		}

		/* Compute the absolute sleep time based on the current time.  */
		uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
		sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
		sleeptimeabs.tv_nsec = nsec % 1000000000;

		/* Sleep for the computed time.  */
		if (do_nanosleep(TIMER_ABSTIME, &amp;sleeptimeabs) != 0) {
			perror("absolute clock_nanosleep");
			return 1;
		}

		/* Get the time after the sleep.  */
		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &amp;after) &lt; 0) {
			perror("clock_gettime[3]");
			return 1;
		}

		/* The time after sleep should always be equal to or after the absolute sleep
		   time passed to clock_nanosleep.  */
		sleepdiff = tsdiff(&amp;sleeptimeabs, &amp;after);
		if (sleepdiff &lt; 0) {
			printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
			result = 1;

			printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
			printf("After  %llu.%09llu\n", after.tv_sec, after.tv_nsec);
			printf("Sleep  %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
		}

		/* The difference between the timestamps taken before and after the
		   clock_nanosleep call should be equal to or more than the duration of the
		   sleep.  */
		diffabs = tsdiff(&amp;before, &amp;after);
		if (diffabs &lt; sleeptime.tv_nsec) {
			printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
			result = 1;
		}

		pthread_cancel(th);

		return result;
	}

Signed-off-by: Stanislaw Gruszka &lt;sgruszka@redhat.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Frederic Weisbecker &lt;fweisbec@gmail.com&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched/numa: Avoid selecting oneself as swap target</title>
<updated>2014-11-16T09:04:17+00:00</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2014-11-10T09:54:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=7af683350cb0ddd0e9d3819b4eb7abe9e2d3e709'/>
<id>7af683350cb0ddd0e9d3819b4eb7abe9e2d3e709</id>
<content type='text'>
Because the whole numa task selection stuff runs with preemption
enabled (its long and expensive) we can end up migrating and selecting
oneself as a swap target. This doesn't really work out well -- we end
up trying to acquire the same lock twice for the swap migrate -- so
avoid this.

Reported-and-Tested-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/20141110100328.GF29390@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Because the whole numa task selection stuff runs with preemption
enabled (its long and expensive) we can end up migrating and selecting
oneself as a swap target. This doesn't really work out well -- we end
up trying to acquire the same lock twice for the swap migrate -- so
avoid this.

Reported-and-Tested-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/20141110100328.GF29390@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sched: Refactor task_struct to use numa_faults instead of numa_* pointers</title>
<updated>2014-11-04T06:17:57+00:00</updated>
<author>
<name>Iulia Manda</name>
<email>iulia.manda21@gmail.com</email>
</author>
<published>2014-10-31T00:13:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=44dba3d5d6a10685fb15bd1954e62016334825e0'/>
<id>44dba3d5d6a10685fb15bd1954e62016334825e0</id>
<content type='text'>
This patch simplifies task_struct by removing the four numa_* pointers
in the same array and replacing them with the array pointer. By doing this,
on x86_64, the size of task_struct is reduced by 3 ulong pointers (24 bytes on
x86_64).

A new parameter is added to the task_faults_idx function so that it can return
an index to the correct offset, corresponding with the old precalculated
pointers.

All of the code in sched/ that depended on task_faults_idx and numa_* was
changed in order to match the new logic.

Signed-off-by: Iulia Manda &lt;iulia.manda21@gmail.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: mgorman@suse.de
Cc: dave@stgolabs.net
Cc: riel@redhat.com
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/20141031001331.GA30662@winterfell
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This patch simplifies task_struct by removing the four numa_* pointers
in the same array and replacing them with the array pointer. By doing this,
on x86_64, the size of task_struct is reduced by 3 ulong pointers (24 bytes on
x86_64).

A new parameter is added to the task_faults_idx function so that it can return
an index to the correct offset, corresponding with the old precalculated
pointers.

All of the code in sched/ that depended on task_faults_idx and numa_* was
changed in order to match the new logic.

Signed-off-by: Iulia Manda &lt;iulia.manda21@gmail.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Cc: mgorman@suse.de
Cc: dave@stgolabs.net
Cc: riel@redhat.com
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/20141031001331.GA30662@winterfell
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
