<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/kernel, branch v3.12.6</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>sched: Avoid throttle_cfs_rq() racing with period_timer stopping</title>
<updated>2013-12-20T15:49:07+00:00</updated>
<author>
<name>Ben Segall</name>
<email>bsegall@google.com</email>
</author>
<published>2013-10-16T18:16:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=e8e21dd5461293632406cc705c9ee85dd0f4156e'/>
<id>e8e21dd5461293632406cc705c9ee85dd0f4156e</id>
<content type='text'>
commit f9f9ffc237dd924f048204e8799da74f9ecf40cf upstream.

throttle_cfs_rq() doesn't check to make sure that period_timer is running,
and while update_curr/assign_cfs_runtime does, a concurrently running
period_timer on another cpu could cancel itself between this cpu's
update_curr and throttle_cfs_rq(). If there are no other cfs_rqs running
in the tg to restart the timer, this causes the cfs_rq to be stranded
forever.

Fix this by calling __start_cfs_bandwidth() in throttle if the timer is
inactive.

(Also add some sched_debug lines for cfs_bandwidth.)

Tested: make a run/sleep task in a cgroup, loop switching the cgroup
between 1ms/100ms quota and unlimited, checking for timer_active=0 and
throttled=1 as a failure. With the throttle_cfs_rq() change commented out
this fails, with the full patch it passes.

Signed-off-by: Ben Segall &lt;bsegall@google.com&gt;
Signed-off-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/20131016181632.22647.84174.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Chris J Arges &lt;chris.j.arges@canonical.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit f9f9ffc237dd924f048204e8799da74f9ecf40cf upstream.

throttle_cfs_rq() doesn't check to make sure that period_timer is running,
and while update_curr/assign_cfs_runtime does, a concurrently running
period_timer on another cpu could cancel itself between this cpu's
update_curr and throttle_cfs_rq(). If there are no other cfs_rqs running
in the tg to restart the timer, this causes the cfs_rq to be stranded
forever.

Fix this by calling __start_cfs_bandwidth() in throttle if the timer is
inactive.

(Also add some sched_debug lines for cfs_bandwidth.)

Tested: make a run/sleep task in a cgroup, loop switching the cgroup
between 1ms/100ms quota and unlimited, checking for timer_active=0 and
throttled=1 as a failure. With the throttle_cfs_rq() change commented out
this fails, with the full patch it passes.

Signed-off-by: Ben Segall &lt;bsegall@google.com&gt;
Signed-off-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/20131016181632.22647.84174.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Chris J Arges &lt;chris.j.arges@canonical.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>futex: fix handling of read-only-mapped hugepages</title>
<updated>2013-12-20T15:48:54+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2013-12-12T17:38:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=60820b2b4fb34a19075ce3562fe948ee9a5b366f'/>
<id>60820b2b4fb34a19075ce3562fe948ee9a5b366f</id>
<content type='text'>
commit f12d5bfceb7e1f9051563381ec047f7f13956c3c upstream.

The hugepage code had the exact same bug that regular pages had in
commit 7485d0d3758e ("futexes: Remove rw parameter from
get_futex_key()").

The regular page case was fixed by commit 9ea71503a8ed ("futex: Fix
regression with read only mappings"), but the transparent hugepage case
(added in a5b338f2b0b1: "thp: update futex compound knowledge") case
remained broken.

Found by Dave Jones and his trinity tool.

Reported-and-tested-by: Dave Jones &lt;davej@fedoraproject.org&gt;
Acked-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Darren Hart &lt;dvhart@linux.intel.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit f12d5bfceb7e1f9051563381ec047f7f13956c3c upstream.

The hugepage code had the exact same bug that regular pages had in
commit 7485d0d3758e ("futexes: Remove rw parameter from
get_futex_key()").

The regular page case was fixed by commit 9ea71503a8ed ("futex: Fix
regression with read only mappings"), but the transparent hugepage case
(added in a5b338f2b0b1: "thp: update futex compound knowledge") case
remained broken.

Found by Dave Jones and his trinity tool.

Reported-and-tested-by: Dave Jones &lt;davej@fedoraproject.org&gt;
Acked-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Darren Hart &lt;dvhart@linux.intel.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>PCI: Disable Bus Master only on kexec reboot</title>
<updated>2013-12-20T15:48:54+00:00</updated>
<author>
<name>Khalid Aziz</name>
<email>khalid.aziz@oracle.com</email>
</author>
<published>2013-11-27T22:19:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=2ab0ff9b179aadf1f30f2dfa9db2b9404324e135'/>
<id>2ab0ff9b179aadf1f30f2dfa9db2b9404324e135</id>
<content type='text'>
commit 4fc9bbf98fd66f879e628d8537ba7c240be2b58e upstream.

Add a flag to tell the PCI subsystem that kernel is shutting down in
preparation to kexec a kernel.  Add code in PCI subsystem to use this flag
to clear Bus Master bit on PCI devices only in case of kexec reboot.

This fixes a power-off problem on Acer Aspire V5-573G and likely other
machines and avoids any other issues caused by clearing Bus Master bit on
PCI devices in normal shutdown path.  The problem was introduced by
b566a22c2332 ("PCI: disable Bus Master on PCI device shutdown").

This patch is based on discussion at
http://marc.info/?l=linux-pci&amp;m=138425645204355&amp;w=2

Link: https://bugzilla.kernel.org/show_bug.cgi?id=63861
Reported-by: Chang Liu &lt;cl91tp@gmail.com&gt;
Signed-off-by: Khalid Aziz &lt;khalid.aziz@oracle.com&gt;
Signed-off-by: Bjorn Helgaas &lt;bhelgaas@google.com&gt;
Acked-by: Konstantin Khlebnikov &lt;koct9i@gmail.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 4fc9bbf98fd66f879e628d8537ba7c240be2b58e upstream.

Add a flag to tell the PCI subsystem that kernel is shutting down in
preparation to kexec a kernel.  Add code in PCI subsystem to use this flag
to clear Bus Master bit on PCI devices only in case of kexec reboot.

This fixes a power-off problem on Acer Aspire V5-573G and likely other
machines and avoids any other issues caused by clearing Bus Master bit on
PCI devices in normal shutdown path.  The problem was introduced by
b566a22c2332 ("PCI: disable Bus Master on PCI device shutdown").

This patch is based on discussion at
http://marc.info/?l=linux-pci&amp;m=138425645204355&amp;w=2

Link: https://bugzilla.kernel.org/show_bug.cgi?id=63861
Reported-by: Chang Liu &lt;cl91tp@gmail.com&gt;
Signed-off-by: Khalid Aziz &lt;khalid.aziz@oracle.com&gt;
Signed-off-by: Bjorn Helgaas &lt;bhelgaas@google.com&gt;
Acked-by: Konstantin Khlebnikov &lt;koct9i@gmail.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>irq: Enable all irqs unconditionally in irq_resume</title>
<updated>2013-12-12T06:37:54+00:00</updated>
<author>
<name>Laxman Dewangan</name>
<email>ldewangan@nvidia.com</email>
</author>
<published>2013-11-25T14:09:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=51bb75cbab7e08ea1ee59c0cfcfa68d1d80dfea4'/>
<id>51bb75cbab7e08ea1ee59c0cfcfa68d1d80dfea4</id>
<content type='text'>
commit ac01810c9d2814238f08a227062e66a35a0e1ea2 upstream.

When the system enters suspend, it disables all interrupts in
suspend_device_irqs(), including the interrupts marked EARLY_RESUME.

On the resume side things are different. The EARLY_RESUME interrupts
are reenabled in sys_core_ops-&gt;resume and the non EARLY_RESUME
interrupts are reenabled in the normal system resume path.

When suspend_noirq() failed or suspend is aborted for any other
reason, we might omit the resume side call to sys_core_ops-&gt;resume()
and therefor the interrupts marked EARLY_RESUME are not reenabled and
stay disabled forever.

To solve this, enable all irqs unconditionally in irq_resume()
regardless whether interrupts marked EARLY_RESUMEhave been already
enabled or not.

This might try to reenable already enabled interrupts in the non
failure case, but the only affected platform is XEN and it has been
confirmed that it does not cause any side effects.

[ tglx: Massaged changelog. ]

Signed-off-by: Laxman Dewangan &lt;ldewangan@nvidia.com&gt;
Acked-by-and-tested-by: Konrad Rzeszutek Wilk &lt;konrad.wilk@oracle.com&gt;
Acked-by: Heiko Stuebner &lt;heiko@sntech.de&gt;
Reviewed-by: Pavel Machek &lt;pavel@ucw.cz&gt;
Cc: &lt;ian.campbell@citrix.com&gt;
Cc: &lt;rjw@rjwysocki.net&gt;
Cc: &lt;len.brown@intel.com&gt;
Cc: &lt;gregkh@linuxfoundation.org&gt;
Link: http://lkml.kernel.org/r/1385388587-16442-1-git-send-email-ldewangan@nvidia.com
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit ac01810c9d2814238f08a227062e66a35a0e1ea2 upstream.

When the system enters suspend, it disables all interrupts in
suspend_device_irqs(), including the interrupts marked EARLY_RESUME.

On the resume side things are different. The EARLY_RESUME interrupts
are reenabled in sys_core_ops-&gt;resume and the non EARLY_RESUME
interrupts are reenabled in the normal system resume path.

When suspend_noirq() failed or suspend is aborted for any other
reason, we might omit the resume side call to sys_core_ops-&gt;resume()
and therefor the interrupts marked EARLY_RESUME are not reenabled and
stay disabled forever.

To solve this, enable all irqs unconditionally in irq_resume()
regardless whether interrupts marked EARLY_RESUMEhave been already
enabled or not.

This might try to reenable already enabled interrupts in the non
failure case, but the only affected platform is XEN and it has been
confirmed that it does not cause any side effects.

[ tglx: Massaged changelog. ]

Signed-off-by: Laxman Dewangan &lt;ldewangan@nvidia.com&gt;
Acked-by-and-tested-by: Konrad Rzeszutek Wilk &lt;konrad.wilk@oracle.com&gt;
Acked-by: Heiko Stuebner &lt;heiko@sntech.de&gt;
Reviewed-by: Pavel Machek &lt;pavel@ucw.cz&gt;
Cc: &lt;ian.campbell@citrix.com&gt;
Cc: &lt;rjw@rjwysocki.net&gt;
Cc: &lt;len.brown@intel.com&gt;
Cc: &lt;gregkh@linuxfoundation.org&gt;
Link: http://lkml.kernel.org/r/1385388587-16442-1-git-send-email-ldewangan@nvidia.com
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>time: Fix 1ns/tick drift w/ GENERIC_TIME_VSYSCALL_OLD</title>
<updated>2013-12-12T06:37:54+00:00</updated>
<author>
<name>Martin Schwidefsky</name>
<email>schwidefsky@de.ibm.com</email>
</author>
<published>2013-11-22T19:44:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=90a01bfc77a797b9fb59d0f88224a9de02953700'/>
<id>90a01bfc77a797b9fb59d0f88224a9de02953700</id>
<content type='text'>
commit 4be77398ac9d948773116b6be4a3c91b3d6ea18c upstream.

Since commit 1e75fa8be9f (time: Condense timekeeper.xtime
into xtime_sec - merged in v3.6), there has been an problem
with the error accounting in the timekeeping code, such that
when truncating to nanoseconds, we round up to the next nsec,
but the balancing adjustment to the ntp_error value was dropped.

This causes 1ns per tick drift forward of the clock.

In 3.7, this logic was isolated to only GENERIC_TIME_VSYSCALL_OLD
architectures (s390, ia64, powerpc).

The fix is simply to balance the accounting and to subtract the
added nanosecond from ntp_error. This allows the internal long-term
clock steering to keep the clock accurate.

While this fix removes the regression added in 1e75fa8be9f, the
ideal solution is to move away from GENERIC_TIME_VSYSCALL_OLD
and use the new VSYSCALL method, which avoids entirely the
nanosecond granular rounding, and the resulting short-term clock
adjustment oscillation needed to keep long term accurate time.

[ jstultz: Many thanks to Martin for his efforts identifying this
  	   subtle bug, and providing the fix. ]

Originally-from: Martin Schwidefsky &lt;schwidefsky@de.ibm.com&gt;
Cc: Tony Luck &lt;tony.luck@intel.com&gt;
Cc: Paul Mackerras &lt;paulus@samba.org&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Andy Lutomirski &lt;luto@amacapital.net&gt;
Cc: Paul Turner &lt;pjt@google.com&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Richard Cochran &lt;richardcochran@gmail.com&gt;
Cc: Prarit Bhargava &lt;prarit@redhat.com&gt;
Cc: Fenghua Yu &lt;fenghua.yu@intel.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Link: http://lkml.kernel.org/r/1385149491-20307-1-git-send-email-john.stultz@linaro.org
Signed-off-by: John Stultz &lt;john.stultz@linaro.org&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 4be77398ac9d948773116b6be4a3c91b3d6ea18c upstream.

Since commit 1e75fa8be9f (time: Condense timekeeper.xtime
into xtime_sec - merged in v3.6), there has been an problem
with the error accounting in the timekeeping code, such that
when truncating to nanoseconds, we round up to the next nsec,
but the balancing adjustment to the ntp_error value was dropped.

This causes 1ns per tick drift forward of the clock.

In 3.7, this logic was isolated to only GENERIC_TIME_VSYSCALL_OLD
architectures (s390, ia64, powerpc).

The fix is simply to balance the accounting and to subtract the
added nanosecond from ntp_error. This allows the internal long-term
clock steering to keep the clock accurate.

While this fix removes the regression added in 1e75fa8be9f, the
ideal solution is to move away from GENERIC_TIME_VSYSCALL_OLD
and use the new VSYSCALL method, which avoids entirely the
nanosecond granular rounding, and the resulting short-term clock
adjustment oscillation needed to keep long term accurate time.

[ jstultz: Many thanks to Martin for his efforts identifying this
  	   subtle bug, and providing the fix. ]

Originally-from: Martin Schwidefsky &lt;schwidefsky@de.ibm.com&gt;
Cc: Tony Luck &lt;tony.luck@intel.com&gt;
Cc: Paul Mackerras &lt;paulus@samba.org&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Andy Lutomirski &lt;luto@amacapital.net&gt;
Cc: Paul Turner &lt;pjt@google.com&gt;
Cc: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Richard Cochran &lt;richardcochran@gmail.com&gt;
Cc: Prarit Bhargava &lt;prarit@redhat.com&gt;
Cc: Fenghua Yu &lt;fenghua.yu@intel.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Link: http://lkml.kernel.org/r/1385149491-20307-1-git-send-email-john.stultz@linaro.org
Signed-off-by: John Stultz &lt;john.stultz@linaro.org&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>ntp: Make periodic RTC update more reliable</title>
<updated>2013-12-08T15:29:16+00:00</updated>
<author>
<name>Miroslav Lichvar</name>
<email>mlichvar@redhat.com</email>
</author>
<published>2013-08-01T17:31:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=27135f5f3ba151eefff1305488dc98cbbc216710'/>
<id>27135f5f3ba151eefff1305488dc98cbbc216710</id>
<content type='text'>
commit a97ad0c4b447a132a322cedc3a5f7fa4cab4b304 upstream.

The current code requires that the scheduled update of the RTC happens
in the closest tick to the half of the second. This seems to be
difficult to achieve reliably. The scheduled work may be missing the
target time by a tick or two and be constantly rescheduled every second.

Relax the limit to 10 ticks. As a typical RTC drifts in the 11-minute
update interval by several milliseconds, this shouldn't affect the
overall accuracy of the RTC much.

Signed-off-by: Miroslav Lichvar &lt;mlichvar@redhat.com&gt;
Signed-off-by: John Stultz &lt;john.stultz@linaro.org&gt;
Cc: Josh Boyer &lt;jwboyer@fedoraproject.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit a97ad0c4b447a132a322cedc3a5f7fa4cab4b304 upstream.

The current code requires that the scheduled update of the RTC happens
in the closest tick to the half of the second. This seems to be
difficult to achieve reliably. The scheduled work may be missing the
target time by a tick or two and be constantly rescheduled every second.

Relax the limit to 10 ticks. As a typical RTC drifts in the 11-minute
update interval by several milliseconds, this shouldn't affect the
overall accuracy of the RTC much.

Signed-off-by: Miroslav Lichvar &lt;mlichvar@redhat.com&gt;
Signed-off-by: John Stultz &lt;john.stultz@linaro.org&gt;
Cc: Josh Boyer &lt;jwboyer@fedoraproject.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>cpuset: Fix memory allocator deadlock</title>
<updated>2013-12-04T19:05:55+00:00</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2013-11-26T14:03:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=764d66de37288d6baab42e07060de4f5c97089a2'/>
<id>764d66de37288d6baab42e07060de4f5c97089a2</id>
<content type='text'>
commit 0fc0287c9ed1ffd3706f8b4d9b314aa102ef1245 upstream.

Juri hit the below lockdep report:

[    4.303391] ======================================================
[    4.303392] [ INFO: SOFTIRQ-safe -&gt; SOFTIRQ-unsafe lock order detected ]
[    4.303394] 3.12.0-dl-peterz+ #144 Not tainted
[    4.303395] ------------------------------------------------------
[    4.303397] kworker/u4:3/689 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[    4.303399]  (&amp;p-&gt;mems_allowed_seq){+.+...}, at: [&lt;ffffffff8114e63c&gt;] new_slab+0x6c/0x290
[    4.303417]
[    4.303417] and this task is already holding:
[    4.303418]  (&amp;(&amp;q-&gt;__queue_lock)-&gt;rlock){..-...}, at: [&lt;ffffffff812d2dfb&gt;] blk_execute_rq_nowait+0x5b/0x100
[    4.303431] which would create a new lock dependency:
[    4.303432]  (&amp;(&amp;q-&gt;__queue_lock)-&gt;rlock){..-...} -&gt; (&amp;p-&gt;mems_allowed_seq){+.+...}
[    4.303436]

[    4.303898] the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
[    4.303918] -&gt; (&amp;p-&gt;mems_allowed_seq){+.+...} ops: 2762 {
[    4.303922]    HARDIRQ-ON-W at:
[    4.303923]                     [&lt;ffffffff8108ab9a&gt;] __lock_acquire+0x65a/0x1ff0
[    4.303926]                     [&lt;ffffffff8108cbe3&gt;] lock_acquire+0x93/0x140
[    4.303929]                     [&lt;ffffffff81063dd6&gt;] kthreadd+0x86/0x180
[    4.303931]                     [&lt;ffffffff816ded6c&gt;] ret_from_fork+0x7c/0xb0
[    4.303933]    SOFTIRQ-ON-W at:
[    4.303933]                     [&lt;ffffffff8108abcc&gt;] __lock_acquire+0x68c/0x1ff0
[    4.303935]                     [&lt;ffffffff8108cbe3&gt;] lock_acquire+0x93/0x140
[    4.303940]                     [&lt;ffffffff81063dd6&gt;] kthreadd+0x86/0x180
[    4.303955]                     [&lt;ffffffff816ded6c&gt;] ret_from_fork+0x7c/0xb0
[    4.303959]    INITIAL USE at:
[    4.303960]                    [&lt;ffffffff8108a884&gt;] __lock_acquire+0x344/0x1ff0
[    4.303963]                    [&lt;ffffffff8108cbe3&gt;] lock_acquire+0x93/0x140
[    4.303966]                    [&lt;ffffffff81063dd6&gt;] kthreadd+0x86/0x180
[    4.303969]                    [&lt;ffffffff816ded6c&gt;] ret_from_fork+0x7c/0xb0
[    4.303972]  }

Which reports that we take mems_allowed_seq with interrupts enabled. A
little digging found that this can only be from
cpuset_change_task_nodemask().

This is an actual deadlock because an interrupt doing an allocation will
hit get_mems_allowed()-&gt;...-&gt;__read_seqcount_begin(), which will spin
forever waiting for the write side to complete.

Cc: John Stultz &lt;john.stultz@linaro.org&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Reported-by: Juri Lelli &lt;juri.lelli@gmail.com&gt;
Signed-off-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Tested-by: Juri Lelli &lt;juri.lelli@gmail.com&gt;
Acked-by: Li Zefan &lt;lizefan@huawei.com&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 0fc0287c9ed1ffd3706f8b4d9b314aa102ef1245 upstream.

Juri hit the below lockdep report:

[    4.303391] ======================================================
[    4.303392] [ INFO: SOFTIRQ-safe -&gt; SOFTIRQ-unsafe lock order detected ]
[    4.303394] 3.12.0-dl-peterz+ #144 Not tainted
[    4.303395] ------------------------------------------------------
[    4.303397] kworker/u4:3/689 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[    4.303399]  (&amp;p-&gt;mems_allowed_seq){+.+...}, at: [&lt;ffffffff8114e63c&gt;] new_slab+0x6c/0x290
[    4.303417]
[    4.303417] and this task is already holding:
[    4.303418]  (&amp;(&amp;q-&gt;__queue_lock)-&gt;rlock){..-...}, at: [&lt;ffffffff812d2dfb&gt;] blk_execute_rq_nowait+0x5b/0x100
[    4.303431] which would create a new lock dependency:
[    4.303432]  (&amp;(&amp;q-&gt;__queue_lock)-&gt;rlock){..-...} -&gt; (&amp;p-&gt;mems_allowed_seq){+.+...}
[    4.303436]

[    4.303898] the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
[    4.303918] -&gt; (&amp;p-&gt;mems_allowed_seq){+.+...} ops: 2762 {
[    4.303922]    HARDIRQ-ON-W at:
[    4.303923]                     [&lt;ffffffff8108ab9a&gt;] __lock_acquire+0x65a/0x1ff0
[    4.303926]                     [&lt;ffffffff8108cbe3&gt;] lock_acquire+0x93/0x140
[    4.303929]                     [&lt;ffffffff81063dd6&gt;] kthreadd+0x86/0x180
[    4.303931]                     [&lt;ffffffff816ded6c&gt;] ret_from_fork+0x7c/0xb0
[    4.303933]    SOFTIRQ-ON-W at:
[    4.303933]                     [&lt;ffffffff8108abcc&gt;] __lock_acquire+0x68c/0x1ff0
[    4.303935]                     [&lt;ffffffff8108cbe3&gt;] lock_acquire+0x93/0x140
[    4.303940]                     [&lt;ffffffff81063dd6&gt;] kthreadd+0x86/0x180
[    4.303955]                     [&lt;ffffffff816ded6c&gt;] ret_from_fork+0x7c/0xb0
[    4.303959]    INITIAL USE at:
[    4.303960]                    [&lt;ffffffff8108a884&gt;] __lock_acquire+0x344/0x1ff0
[    4.303963]                    [&lt;ffffffff8108cbe3&gt;] lock_acquire+0x93/0x140
[    4.303966]                    [&lt;ffffffff81063dd6&gt;] kthreadd+0x86/0x180
[    4.303969]                    [&lt;ffffffff816ded6c&gt;] ret_from_fork+0x7c/0xb0
[    4.303972]  }

Which reports that we take mems_allowed_seq with interrupts enabled. A
little digging found that this can only be from
cpuset_change_task_nodemask().

This is an actual deadlock because an interrupt doing an allocation will
hit get_mems_allowed()-&gt;...-&gt;__read_seqcount_begin(), which will spin
forever waiting for the write side to complete.

Cc: John Stultz &lt;john.stultz@linaro.org&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Reported-by: Juri Lelli &lt;juri.lelli@gmail.com&gt;
Signed-off-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Tested-by: Juri Lelli &lt;juri.lelli@gmail.com&gt;
Acked-by: Li Zefan &lt;lizefan@huawei.com&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>cgroup: fix cgroup_subsys_state leak for seq_files</title>
<updated>2013-12-04T19:05:55+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2013-11-27T23:16:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=e6af24fef5c9ed17a25f062b3c22c3df44904dc9'/>
<id>e6af24fef5c9ed17a25f062b3c22c3df44904dc9</id>
<content type='text'>
commit e605b36575e896edd8161534550c9ea021b03bc0 upstream.

If a cgroup file implements either read_map() or read_seq_string(),
such file is served using seq_file by overriding file-&gt;f_op to
cgroup_seqfile_operations, which also overrides the release method to
single_release() from cgroup_file_release().

Because cgroup_file_open() didn't use to acquire any resources, this
used to be fine, but since f7d58818ba42 ("cgroup: pin
cgroup_subsys_state when opening a cgroupfs file"), cgroup_file_open()
pins the css (cgroup_subsys_state) which is put by
cgroup_file_release().  The patch forgot to update the release path
for seq_files and each open/release cycle leaks a css reference.

Fix it by updating cgroup_file_release() to also handle seq_files and
using it for seq_file release path too.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit e605b36575e896edd8161534550c9ea021b03bc0 upstream.

If a cgroup file implements either read_map() or read_seq_string(),
such file is served using seq_file by overriding file-&gt;f_op to
cgroup_seqfile_operations, which also overrides the release method to
single_release() from cgroup_file_release().

Because cgroup_file_open() didn't use to acquire any resources, this
used to be fine, but since f7d58818ba42 ("cgroup: pin
cgroup_subsys_state when opening a cgroupfs file"), cgroup_file_open()
pins the css (cgroup_subsys_state) which is put by
cgroup_file_release().  The patch forgot to update the release path
for seq_files and each open/release cycle leaks a css reference.

Fix it by updating cgroup_file_release() to also handle seq_files and
using it for seq_file release path too.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>cgroup: use a dedicated workqueue for cgroup destruction</title>
<updated>2013-12-04T19:05:55+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2013-11-22T22:14:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=65d6ec10c7cf2575de2aa9159f8cf43cbc1074fe'/>
<id>65d6ec10c7cf2575de2aa9159f8cf43cbc1074fe</id>
<content type='text'>
commit e5fca243abae1445afbfceebda5f08462ef869d3 upstream.

Since be44562613851 ("cgroup: remove synchronize_rcu() from
cgroup_diput()"), cgroup destruction path makes use of workqueue.  css
freeing is performed from a work item from that point on and a later
commit, ea15f8ccdb430 ("cgroup: split cgroup destruction into two
steps"), moves css offlining to workqueue too.

As cgroup destruction isn't depended upon for memory reclaim, the
destruction work items were put on the system_wq; unfortunately, some
controller may block in the destruction path for considerable duration
while holding cgroup_mutex.  As large part of destruction path is
synchronized through cgroup_mutex, when combined with high rate of
cgroup removals, this has potential to fill up system_wq's max_active
of 256.

Also, it turns out that memcg's css destruction path ends up queueing
and waiting for work items on system_wq through work_on_cpu().  If
such operation happens while system_wq is fully occupied by cgroup
destruction work items, work_on_cpu() can't make forward progress
because system_wq is full and other destruction work items on
system_wq can't make forward progress because the work item waiting
for work_on_cpu() is holding cgroup_mutex, leading to deadlock.

This can be fixed by queueing destruction work items on a separate
workqueue.  This patch creates a dedicated workqueue -
cgroup_destroy_wq - for this purpose.  As these work items shouldn't
have inter-dependencies and mostly serialized by cgroup_mutex anyway,
giving high concurrency level doesn't buy anything and the workqueue's
@max_active is set to 1 so that destruction work items are executed
one by one on each CPU.

Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
cgroup_destroy_wq can't be allocated from cgroup_init().  Do it from a
separate core_initcall().  In the future, we probably want to reorder
so that workqueue init happens before cgroup_init().

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-by: Hugh Dickins &lt;hughd@google.com&gt;
Reported-by: Shawn Bohrer &lt;shawn.bohrer@gmail.com&gt;
Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit e5fca243abae1445afbfceebda5f08462ef869d3 upstream.

Since be44562613851 ("cgroup: remove synchronize_rcu() from
cgroup_diput()"), cgroup destruction path makes use of workqueue.  css
freeing is performed from a work item from that point on and a later
commit, ea15f8ccdb430 ("cgroup: split cgroup destruction into two
steps"), moves css offlining to workqueue too.

As cgroup destruction isn't depended upon for memory reclaim, the
destruction work items were put on the system_wq; unfortunately, some
controller may block in the destruction path for considerable duration
while holding cgroup_mutex.  As large part of destruction path is
synchronized through cgroup_mutex, when combined with high rate of
cgroup removals, this has potential to fill up system_wq's max_active
of 256.

Also, it turns out that memcg's css destruction path ends up queueing
and waiting for work items on system_wq through work_on_cpu().  If
such operation happens while system_wq is fully occupied by cgroup
destruction work items, work_on_cpu() can't make forward progress
because system_wq is full and other destruction work items on
system_wq can't make forward progress because the work item waiting
for work_on_cpu() is holding cgroup_mutex, leading to deadlock.

This can be fixed by queueing destruction work items on a separate
workqueue.  This patch creates a dedicated workqueue -
cgroup_destroy_wq - for this purpose.  As these work items shouldn't
have inter-dependencies and mostly serialized by cgroup_mutex anyway,
giving high concurrency level doesn't buy anything and the workqueue's
@max_active is set to 1 so that destruction work items are executed
one by one on each CPU.

Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
cgroup_destroy_wq can't be allocated from cgroup_init().  Do it from a
separate core_initcall().  In the future, we probably want to reorder
so that workqueue init happens before cgroup_init().

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-by: Hugh Dickins &lt;hughd@google.com&gt;
Reported-by: Shawn Bohrer &lt;shawn.bohrer@gmail.com&gt;
Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>workqueue: fix ordered workqueues in NUMA setups</title>
<updated>2013-12-04T19:05:55+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2013-09-05T16:30:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=a432a4023f22f8c0aa9fa7262b124354351e6390'/>
<id>a432a4023f22f8c0aa9fa7262b124354351e6390</id>
<content type='text'>
commit 8a2b75384444488fc4f2cbb9f0921b6a0794838f upstream.

An ordered workqueue implements execution ordering by using single
pool_workqueue with max_active == 1.  On a given pool_workqueue, work
items are processed in FIFO order and limiting max_active to 1
enforces the queued work items to be processed one by one.

Unfortunately, 4c16bd327c ("workqueue: implement NUMA affinity for
unbound workqueues") accidentally broke this guarantee by applying
NUMA affinity to ordered workqueues too.  On NUMA setups, an ordered
workqueue would end up with separate pool_workqueues for different
nodes.  Each pool_workqueue still limits max_active to 1 but multiple
work items may be executed concurrently and out of order depending on
which node they are queued to.

Fix it by using dedicated ordered_wq_attrs[] when creating ordered
workqueues.  The new attrs match the unbound ones except that no_numa
is always set thus forcing all NUMA nodes to share the default
pool_workqueue.

While at it, add sanity check in workqueue creation path which
verifies that an ordered workqueues has only the default
pool_workqueue.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-by: Libin &lt;huawei.libin@huawei.com&gt;
Cc: Lai Jiangshan &lt;laijs@cn.fujitsu.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 8a2b75384444488fc4f2cbb9f0921b6a0794838f upstream.

An ordered workqueue implements execution ordering by using single
pool_workqueue with max_active == 1.  On a given pool_workqueue, work
items are processed in FIFO order and limiting max_active to 1
enforces the queued work items to be processed one by one.

Unfortunately, 4c16bd327c ("workqueue: implement NUMA affinity for
unbound workqueues") accidentally broke this guarantee by applying
NUMA affinity to ordered workqueues too.  On NUMA setups, an ordered
workqueue would end up with separate pool_workqueues for different
nodes.  Each pool_workqueue still limits max_active to 1 but multiple
work items may be executed concurrently and out of order depending on
which node they are queued to.

Fix it by using dedicated ordered_wq_attrs[] when creating ordered
workqueues.  The new attrs match the unbound ones except that no_numa
is always set thus forcing all NUMA nodes to share the default
pool_workqueue.

While at it, add sanity check in workqueue creation path which
verifies that an ordered workqueues has only the default
pool_workqueue.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Reported-by: Libin &lt;huawei.libin@huawei.com&gt;
Cc: Lai Jiangshan &lt;laijs@cn.fujitsu.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
</feed>
