<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/mm, branch v3.10</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>Merge branch 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux</title>
<updated>2013-06-18T16:27:47+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2013-06-18T16:27:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=4c3577c58f7efcddeaa269e7ddbe75e8acfbb7de'/>
<id>4c3577c58f7efcddeaa269e7ddbe75e8acfbb7de</id>
<content type='text'>
Pull SLAB fix from Pekka Enberg:
 "A slab regression fix by Sasha Levin"

* 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
  slab: prevent warnings when allocating with __GFP_NOWARN
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull SLAB fix from Pekka Enberg:
 "A slab regression fix by Sasha Levin"

* 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
  slab: prevent warnings when allocating with __GFP_NOWARN
</pre>
</div>
</content>
</entry>
<entry>
<title>slab: prevent warnings when allocating with __GFP_NOWARN</title>
<updated>2013-06-13T07:01:58+00:00</updated>
<author>
<name>Sasha Levin</name>
<email>sasha.levin@oracle.com</email>
</author>
<published>2013-06-10T19:18:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=907985f48bc60818e291c631249f9bc84c83a06f'/>
<id>907985f48bc60818e291c631249f9bc84c83a06f</id>
<content type='text'>
Sasha Levin noticed that the warning introduced by commit 6286ae9
("slab: Return NULL for oversized allocations) is being triggered:

  WARNING: CPU: 15 PID: 21519 at mm/slab_common.c:376 kmalloc_slab+0x2f/0xb0()
  can: request_module (can-proto-4) failed.
  mpoa: proc_mpc_write: could not parse ''
  Modules linked in:
  CPU: 15 PID: 21519 Comm: trinity-child15 Tainted: G W    3.10.0-rc4-next-20130607-sasha-00011-gcd78395-dirty #2
   0000000000000009 ffff880020a95e30 ffffffff83ff4041 0000000000000000
   ffff880020a95e68 ffffffff8111fe12 fffffffffffffff0 00000000000082d0
   0000000000080000 0000000000080000 0000000001400000 ffff880020a95e78
  Call Trace:
   [&lt;ffffffff83ff4041&gt;] dump_stack+0x4e/0x82
   [&lt;ffffffff8111fe12&gt;] warn_slowpath_common+0x82/0xb0
   [&lt;ffffffff8111fe55&gt;] warn_slowpath_null+0x15/0x20
   [&lt;ffffffff81243dcf&gt;] kmalloc_slab+0x2f/0xb0
   [&lt;ffffffff81278d54&gt;] __kmalloc+0x24/0x4b0
   [&lt;ffffffff8196ffe3&gt;] ? security_capable+0x13/0x20
   [&lt;ffffffff812a26b7&gt;] ? pipe_fcntl+0x107/0x210
   [&lt;ffffffff812a26b7&gt;] pipe_fcntl+0x107/0x210
   [&lt;ffffffff812b7ea0&gt;] ? fget_raw_light+0x130/0x3f0
   [&lt;ffffffff812aa5fb&gt;] SyS_fcntl+0x60b/0x6a0
   [&lt;ffffffff8403ca98&gt;] tracesys+0xe1/0xe6

Andrew Morton writes:

  __GFP_NOWARN is frequently used by kernel code to probe for "how big
  an allocation can I get".  That's a bit lame, but it's used on slow
  paths and is pretty simple.

However, SLAB would still spew a warning when a big allocation happens
if the __GFP_NOWARN flag is _not_ set to expose kernel bugs.

Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
[ penberg@kernel.org: improve changelog ]
Signed-off-by: Pekka Enberg &lt;penberg@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Sasha Levin noticed that the warning introduced by commit 6286ae9
("slab: Return NULL for oversized allocations) is being triggered:

  WARNING: CPU: 15 PID: 21519 at mm/slab_common.c:376 kmalloc_slab+0x2f/0xb0()
  can: request_module (can-proto-4) failed.
  mpoa: proc_mpc_write: could not parse ''
  Modules linked in:
  CPU: 15 PID: 21519 Comm: trinity-child15 Tainted: G W    3.10.0-rc4-next-20130607-sasha-00011-gcd78395-dirty #2
   0000000000000009 ffff880020a95e30 ffffffff83ff4041 0000000000000000
   ffff880020a95e68 ffffffff8111fe12 fffffffffffffff0 00000000000082d0
   0000000000080000 0000000000080000 0000000001400000 ffff880020a95e78
  Call Trace:
   [&lt;ffffffff83ff4041&gt;] dump_stack+0x4e/0x82
   [&lt;ffffffff8111fe12&gt;] warn_slowpath_common+0x82/0xb0
   [&lt;ffffffff8111fe55&gt;] warn_slowpath_null+0x15/0x20
   [&lt;ffffffff81243dcf&gt;] kmalloc_slab+0x2f/0xb0
   [&lt;ffffffff81278d54&gt;] __kmalloc+0x24/0x4b0
   [&lt;ffffffff8196ffe3&gt;] ? security_capable+0x13/0x20
   [&lt;ffffffff812a26b7&gt;] ? pipe_fcntl+0x107/0x210
   [&lt;ffffffff812a26b7&gt;] pipe_fcntl+0x107/0x210
   [&lt;ffffffff812b7ea0&gt;] ? fget_raw_light+0x130/0x3f0
   [&lt;ffffffff812aa5fb&gt;] SyS_fcntl+0x60b/0x6a0
   [&lt;ffffffff8403ca98&gt;] tracesys+0xe1/0xe6

Andrew Morton writes:

  __GFP_NOWARN is frequently used by kernel code to probe for "how big
  an allocation can I get".  That's a bit lame, but it's used on slow
  paths and is pretty simple.

However, SLAB would still spew a warning when a big allocation happens
if the __GFP_NOWARN flag is _not_ set to expose kernel bugs.

Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
[ penberg@kernel.org: improve changelog ]
Signed-off-by: Pekka Enberg &lt;penberg@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: memcontrol: fix lockless reclaim hierarchy iterator</title>
<updated>2013-06-12T23:29:46+00:00</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2013-06-12T21:05:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=89dc991f0f5272c307c746fdd57d0bff382b1ba2'/>
<id>89dc991f0f5272c307c746fdd57d0bff382b1ba2</id>
<content type='text'>
The lockless reclaim hierarchy iterator currently has a misplaced
barrier that can lead to use-after-free crashes.

The reclaim hierarchy iterator consist of a sequence count and a
position pointer that are read and written locklessly, with memory
barriers enforcing ordering.

The write side sets the position pointer first, then updates the
sequence count to "publish" the new position.  Likewise, the read side
must read the sequence count first, then the position.  If the sequence
count is up to date, it's guaranteed that the position is up to date as
well:

  writer:                         reader:
  iter-&gt;position = position       if iter-&gt;sequence == expected:
  smp_wmb()                           smp_rmb()
  iter-&gt;sequence = sequence           position = iter-&gt;position

However, the read side barrier is currently misplaced, which can lead to
dereferencing stale position pointers that no longer point to valid
memory.  Fix this.

Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reported-by: Tejun Heo &lt;tj@kernel.org&gt;
Reviewed-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Glauber Costa &lt;glommer@parallels.com&gt;
Cc: &lt;stable@kernel.org&gt;		[3.10+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The lockless reclaim hierarchy iterator currently has a misplaced
barrier that can lead to use-after-free crashes.

The reclaim hierarchy iterator consist of a sequence count and a
position pointer that are read and written locklessly, with memory
barriers enforcing ordering.

The write side sets the position pointer first, then updates the
sequence count to "publish" the new position.  Likewise, the read side
must read the sequence count first, then the position.  If the sequence
count is up to date, it's guaranteed that the position is up to date as
well:

  writer:                         reader:
  iter-&gt;position = position       if iter-&gt;sequence == expected:
  smp_wmb()                           smp_rmb()
  iter-&gt;sequence = sequence           position = iter-&gt;position

However, the read side barrier is currently misplaced, which can lead to
dereferencing stale position pointers that no longer point to valid
memory.  Fix this.

Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reported-by: Tejun Heo &lt;tj@kernel.org&gt;
Reviewed-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Glauber Costa &lt;glommer@parallels.com&gt;
Cc: &lt;stable@kernel.org&gt;		[3.10+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>frontswap: fix incorrect zeroing and allocation size for frontswap_map</title>
<updated>2013-06-12T23:29:46+00:00</updated>
<author>
<name>Akinobu Mita</name>
<email>akinobu.mita@gmail.com</email>
</author>
<published>2013-06-12T21:05:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=7b57976da48e60b66fdbb9e97f5711b5382a49d7'/>
<id>7b57976da48e60b66fdbb9e97f5711b5382a49d7</id>
<content type='text'>
The bitmap accessed by bitops must have enough size to hold the required
numbers of bits rounded up to a multiple of BITS_PER_LONG.  And the
bitmap must not be zeroed by memset() if the number of bits cleared is
not a multiple of BITS_PER_LONG.

This fixes incorrect zeroing and allocation size for frontswap_map.  The
incorrect zeroing part doesn't cause any problem because frontswap_map
is freed just after zeroing.  But the wrongly calculated allocation size
may cause the problem.

For 32bit systems, the allocation size of frontswap_map is about twice
as large as required size.  For 64bit systems, the allocation size is
smaller than requeired if the number of bits is not a multiple of
BITS_PER_LONG.

Signed-off-by: Akinobu Mita &lt;akinobu.mita@gmail.com&gt;
Cc: Konrad Rzeszutek Wilk &lt;konrad.wilk@oracle.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The bitmap accessed by bitops must have enough size to hold the required
numbers of bits rounded up to a multiple of BITS_PER_LONG.  And the
bitmap must not be zeroed by memset() if the number of bits cleared is
not a multiple of BITS_PER_LONG.

This fixes incorrect zeroing and allocation size for frontswap_map.  The
incorrect zeroing part doesn't cause any problem because frontswap_map
is freed just after zeroing.  But the wrongly calculated allocation size
may cause the problem.

For 32bit systems, the allocation size of frontswap_map is about twice
as large as required size.  For 64bit systems, the allocation size is
smaller than requeired if the number of bits is not a multiple of
BITS_PER_LONG.

Signed-off-by: Akinobu Mita &lt;akinobu.mita@gmail.com&gt;
Cc: Konrad Rzeszutek Wilk &lt;konrad.wilk@oracle.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: migration: add migrate_entry_wait_huge()</title>
<updated>2013-06-12T23:29:46+00:00</updated>
<author>
<name>Naoya Horiguchi</name>
<email>n-horiguchi@ah.jp.nec.com</email>
</author>
<published>2013-06-12T21:05:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=30dad30922ccc733cfdbfe232090cf674dc374dc'/>
<id>30dad30922ccc733cfdbfe232090cf674dc374dc</id>
<content type='text'>
When we have a page fault for the address which is backed by a hugepage
under migration, the kernel can't wait correctly and do busy looping on
hugepage fault until the migration finishes.  As a result, users who try
to kick hugepage migration (via soft offlining, for example) occasionally
experience long delay or soft lockup.

This is because pte_offset_map_lock() can't get a correct migration entry
or a correct page table lock for hugepage.  This patch introduces
migration_entry_wait_huge() to solve this.

Signed-off-by: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Reviewed-by: Rik van Riel &lt;riel@redhat.com&gt;
Reviewed-by: Wanpeng Li &lt;liwanp@linux.vnet.ibm.com&gt;
Reviewed-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Andi Kleen &lt;andi@firstfloor.org&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[2.6.35+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When we have a page fault for the address which is backed by a hugepage
under migration, the kernel can't wait correctly and do busy looping on
hugepage fault until the migration finishes.  As a result, users who try
to kick hugepage migration (via soft offlining, for example) occasionally
experience long delay or soft lockup.

This is because pte_offset_map_lock() can't get a correct migration entry
or a correct page table lock for hugepage.  This patch introduces
migration_entry_wait_huge() to solve this.

Signed-off-by: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Reviewed-by: Rik van Riel &lt;riel@redhat.com&gt;
Reviewed-by: Wanpeng Li &lt;liwanp@linux.vnet.ibm.com&gt;
Reviewed-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Andi Kleen &lt;andi@firstfloor.org&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[2.6.35+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm/page_alloc.c: fix watermark check in __zone_watermark_ok()</title>
<updated>2013-06-12T23:29:46+00:00</updated>
<author>
<name>Tomasz Stanislawski</name>
<email>t.stanislaws@samsung.com</email>
</author>
<published>2013-06-12T21:05:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=026b08147923142e925a7d0aaa39038055ae0156'/>
<id>026b08147923142e925a7d0aaa39038055ae0156</id>
<content type='text'>
The watermark check consists of two sub-checks.  The first one is:

	if (free_pages &lt;= min + lowmem_reserve)
		return false;

The check assures that there is minimal amount of RAM in the zone.  If
CMA is used then the free_pages is reduced by the number of free pages
in CMA prior to the over-mentioned check.

	if (!(alloc_flags &amp; ALLOC_CMA))
		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);

This prevents the zone from being drained from pages available for
non-movable allocations.

The second check prevents the zone from getting too fragmented.

	for (o = 0; o &lt; order; o++) {
		free_pages -= z-&gt;free_area[o].nr_free &lt;&lt; o;
		min &gt;&gt;= 1;
		if (free_pages &lt;= min)
			return false;
	}

The field z-&gt;free_area[o].nr_free is equal to the number of free pages
including free CMA pages.  Therefore the CMA pages are subtracted twice.
This may cause a false positive fail of __zone_watermark_ok() if the CMA
area gets strongly fragmented.  In such a case there are many 0-order
free pages located in CMA.  Those pages are subtracted twice therefore
they will quickly drain free_pages during the check against
fragmentation.  The test fails even though there are many free non-cma
pages in the zone.

This patch fixes this issue by subtracting CMA pages only for a purpose of
(free_pages &lt;= min + lowmem_reserve) check.

Laura said:

  We were observing allocation failures of higher order pages (order 5 =
  128K typically) under tight memory conditions resulting in driver
  failure.  The output from the page allocation failure showed plenty of
  free pages of the appropriate order/type/zone and mostly CMA pages in
  the lower orders.

  For full disclosure, we still observed some page allocation failures
  even after applying the patch but the number was drastically reduced and
  those failures were attributed to fragmentation/other system issues.

Signed-off-by: Tomasz Stanislawski &lt;t.stanislaws@samsung.com&gt;
Signed-off-by: Kyungmin Park &lt;kyungmin.park@samsung.com&gt;
Tested-by: Laura Abbott &lt;lauraa@codeaurora.org&gt;
Cc: Bartlomiej Zolnierkiewicz &lt;b.zolnierkie@samsung.com&gt;
Acked-by: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Mel Gorman &lt;mel@csn.ul.ie&gt;
Tested-by: Marek Szyprowski &lt;m.szyprowski@samsung.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[3.7+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The watermark check consists of two sub-checks.  The first one is:

	if (free_pages &lt;= min + lowmem_reserve)
		return false;

The check assures that there is minimal amount of RAM in the zone.  If
CMA is used then the free_pages is reduced by the number of free pages
in CMA prior to the over-mentioned check.

	if (!(alloc_flags &amp; ALLOC_CMA))
		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);

This prevents the zone from being drained from pages available for
non-movable allocations.

The second check prevents the zone from getting too fragmented.

	for (o = 0; o &lt; order; o++) {
		free_pages -= z-&gt;free_area[o].nr_free &lt;&lt; o;
		min &gt;&gt;= 1;
		if (free_pages &lt;= min)
			return false;
	}

The field z-&gt;free_area[o].nr_free is equal to the number of free pages
including free CMA pages.  Therefore the CMA pages are subtracted twice.
This may cause a false positive fail of __zone_watermark_ok() if the CMA
area gets strongly fragmented.  In such a case there are many 0-order
free pages located in CMA.  Those pages are subtracted twice therefore
they will quickly drain free_pages during the check against
fragmentation.  The test fails even though there are many free non-cma
pages in the zone.

This patch fixes this issue by subtracting CMA pages only for a purpose of
(free_pages &lt;= min + lowmem_reserve) check.

Laura said:

  We were observing allocation failures of higher order pages (order 5 =
  128K typically) under tight memory conditions resulting in driver
  failure.  The output from the page allocation failure showed plenty of
  free pages of the appropriate order/type/zone and mostly CMA pages in
  the lower orders.

  For full disclosure, we still observed some page allocation failures
  even after applying the patch but the number was drastically reduced and
  those failures were attributed to fragmentation/other system issues.

Signed-off-by: Tomasz Stanislawski &lt;t.stanislaws@samsung.com&gt;
Signed-off-by: Kyungmin Park &lt;kyungmin.park@samsung.com&gt;
Tested-by: Laura Abbott &lt;lauraa@codeaurora.org&gt;
Cc: Bartlomiej Zolnierkiewicz &lt;b.zolnierkie@samsung.com&gt;
Acked-by: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Mel Gorman &lt;mel@csn.ul.ie&gt;
Tested-by: Marek Szyprowski &lt;m.szyprowski@samsung.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[3.7+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O completion</title>
<updated>2013-06-12T23:29:45+00:00</updated>
<author>
<name>Rafael Aquini</name>
<email>aquini@redhat.com</email>
</author>
<published>2013-06-12T21:04:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=cbab0e4eec299e9059199ebe6daf48730be46d2b'/>
<id>cbab0e4eec299e9059199ebe6daf48730be46d2b</id>
<content type='text'>
read_swap_cache_async() can race against get_swap_page(), and stumble
across a SWAP_HAS_CACHE entry in the swap map whose page wasn't brought
into the swapcache yet.

This transient swap_map state is expected to be transitory, but the
actual placement of discard at scan_swap_map() inserts a wait for I/O
completion thus making the thread at read_swap_cache_async() to loop
around its -EEXIST case, while the other end at get_swap_page() is
scheduled away at scan_swap_map().  This can leave the system deadlocked
if the I/O completion happens to be waiting on the CPU waitqueue where
read_swap_cache_async() is busy looping and !CONFIG_PREEMPT.

This patch introduces a cond_resched() call to make the aforementioned
read_swap_cache_async() busy loop condition to bail out when necessary,
thus avoiding the subtle race window.

Signed-off-by: Rafael Aquini &lt;aquini@redhat.com&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Acked-by: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Shaohua Li &lt;shli@kernel.org&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
read_swap_cache_async() can race against get_swap_page(), and stumble
across a SWAP_HAS_CACHE entry in the swap map whose page wasn't brought
into the swapcache yet.

This transient swap_map state is expected to be transitory, but the
actual placement of discard at scan_swap_map() inserts a wait for I/O
completion thus making the thread at read_swap_cache_async() to loop
around its -EEXIST case, while the other end at get_swap_page() is
scheduled away at scan_swap_map().  This can leave the system deadlocked
if the I/O completion happens to be waiting on the CPU waitqueue where
read_swap_cache_async() is busy looping and !CONFIG_PREEMPT.

This patch introduces a cond_resched() call to make the aforementioned
read_swap_cache_async() busy loop condition to bail out when necessary,
thus avoiding the subtle race window.

Signed-off-by: Rafael Aquini &lt;aquini@redhat.com&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Acked-by: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Shaohua Li &lt;shli@kernel.org&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>memcg: don't initialize kmem-cache destroying work for root caches</title>
<updated>2013-06-12T23:29:45+00:00</updated>
<author>
<name>Andrey Vagin</name>
<email>avagin@openvz.org</email>
</author>
<published>2013-06-12T21:04:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=f101a9464bfbda42730b54a66f926d75ed2cd31e'/>
<id>f101a9464bfbda42730b54a66f926d75ed2cd31e</id>
<content type='text'>
struct memcg_cache_params has a union.  Different parts of this union
are used for root and non-root caches.  A part with destroying work is
used only for non-root caches.

  BUG: unable to handle kernel paging request at 0000000fffffffe0
  IP: kmem_cache_alloc+0x41/0x1f0
  Modules linked in: netlink_diag af_packet_diag udp_diag tcp_diag inet_diag unix_diag ip6table_filter ip6_tables i2c_piix4 virtio_net virtio_balloon microcode i2c_core pcspkr floppy
  CPU: 0 PID: 1929 Comm: lt-vzctl Tainted: G      D      3.10.0-rc1+ #2
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  RIP: kmem_cache_alloc+0x41/0x1f0
  Call Trace:
   getname_flags.part.34+0x30/0x140
   getname+0x38/0x60
   do_sys_open+0xc5/0x1e0
   SyS_open+0x22/0x30
   system_call_fastpath+0x16/0x1b
  Code: f4 53 48 83 ec 18 8b 05 8e 53 b7 00 4c 8b 4d 08 21 f0 a8 10 74 0d 4c 89 4d c0 e8 1b 76 4a 00 4c 8b 4d c0 e9 92 00 00 00 4d 89 f5 &lt;4d&gt; 8b 45 00 65 4c 03 04 25 48 cd 00 00 49 8b 50 08 4d 8b 38 49
  RIP  [&lt;ffffffff8116b641&gt;] kmem_cache_alloc+0x41/0x1f0

Signed-off-by: Andrey Vagin &lt;avagin@openvz.org&gt;
Cc: Konstantin Khlebnikov &lt;khlebnikov@openvz.org&gt;
Cc: Glauber Costa &lt;glommer@parallels.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Balbir Singh &lt;bsingharora@gmail.com&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Reviewed-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Li Zefan &lt;lizefan@huawei.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[3.9.x]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
struct memcg_cache_params has a union.  Different parts of this union
are used for root and non-root caches.  A part with destroying work is
used only for non-root caches.

  BUG: unable to handle kernel paging request at 0000000fffffffe0
  IP: kmem_cache_alloc+0x41/0x1f0
  Modules linked in: netlink_diag af_packet_diag udp_diag tcp_diag inet_diag unix_diag ip6table_filter ip6_tables i2c_piix4 virtio_net virtio_balloon microcode i2c_core pcspkr floppy
  CPU: 0 PID: 1929 Comm: lt-vzctl Tainted: G      D      3.10.0-rc1+ #2
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  RIP: kmem_cache_alloc+0x41/0x1f0
  Call Trace:
   getname_flags.part.34+0x30/0x140
   getname+0x38/0x60
   do_sys_open+0xc5/0x1e0
   SyS_open+0x22/0x30
   system_call_fastpath+0x16/0x1b
  Code: f4 53 48 83 ec 18 8b 05 8e 53 b7 00 4c 8b 4d 08 21 f0 a8 10 74 0d 4c 89 4d c0 e8 1b 76 4a 00 4c 8b 4d c0 e9 92 00 00 00 4d 89 f5 &lt;4d&gt; 8b 45 00 65 4c 03 04 25 48 cd 00 00 49 8b 50 08 4d 8b 38 49
  RIP  [&lt;ffffffff8116b641&gt;] kmem_cache_alloc+0x41/0x1f0

Signed-off-by: Andrey Vagin &lt;avagin@openvz.org&gt;
Cc: Konstantin Khlebnikov &lt;khlebnikov@openvz.org&gt;
Cc: Glauber Costa &lt;glommer@parallels.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Balbir Singh &lt;bsingharora@gmail.com&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Reviewed-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Li Zefan &lt;lizefan@huawei.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[3.9.x]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>arch, mm: Remove tlb_fast_mode()</title>
<updated>2013-06-06T01:07:26+00:00</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2013-06-05T10:26:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=29eb77825cc7da8d45b642de2de3d423dc8a363f'/>
<id>29eb77825cc7da8d45b642de2de3d423dc8a363f</id>
<content type='text'>
Since the introduction of preemptible mmu_gather TLB fast mode has been
broken. TLB fast mode relies on there being absolutely no concurrency;
it frees pages first and invalidates TLBs later.

However now we can get concurrency and stuff goes *bang*.

This patch removes all tlb_fast_mode() code; it was found the better
option vs trying to patch the hole by entangling tlb invalidation with
the scheduler.

Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Russell King &lt;linux@arm.linux.org.uk&gt;
Cc: Tony Luck &lt;tony.luck@intel.com&gt;
Reported-by: Max Filippov &lt;jcmvbkbc@gmail.com&gt;
Signed-off-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Since the introduction of preemptible mmu_gather TLB fast mode has been
broken. TLB fast mode relies on there being absolutely no concurrency;
it frees pages first and invalidates TLBs later.

However now we can get concurrency and stuff goes *bang*.

This patch removes all tlb_fast_mode() code; it was found the better
option vs trying to patch the hole by entangling tlb invalidation with
the scheduler.

Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Russell King &lt;linux@arm.linux.org.uk&gt;
Cc: Tony Luck &lt;tony.luck@intel.com&gt;
Reported-by: Max Filippov &lt;jcmvbkbc@gmail.com&gt;
Signed-off-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas</title>
<updated>2013-05-24T23:22:53+00:00</updated>
<author>
<name>Cliff Wickman</name>
<email>cpw@sgi.com</email>
</author>
<published>2013-05-24T22:55:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=a9ff785e4437c83d2179161e012f5bdfbd6381f0'/>
<id>a9ff785e4437c83d2179161e012f5bdfbd6381f0</id>
<content type='text'>
A panic can be caused by simply cat'ing /proc/&lt;pid&gt;/smaps while an
application has a VM_PFNMAP range.  It happened in-house when a
benchmarker was trying to decipher the memory layout of his program.

/proc/&lt;pid&gt;/smaps and similar walks through a user page table should not
be looking at VM_PFNMAP areas.

Certain tests in walk_page_range() (specifically split_huge_page_pmd())
assume that all the mapped PFN's are backed with page structures.  And
this is not usually true for VM_PFNMAP areas.  This can result in panics
on kernel page faults when attempting to address those page structures.

There are a half dozen callers of walk_page_range() that walk through a
task's entire page table (as N.  Horiguchi pointed out).  So rather than
change all of them, this patch changes just walk_page_range() to ignore
VM_PFNMAP areas.

The logic of hugetlb_vma() is moved back into walk_page_range(), as we
want to test any vma in the range.

VM_PFNMAP areas are used by:
- graphics memory manager   gpu/drm/drm_gem.c
- global reference unit     sgi-gru/grufile.c
- sgi special memory        char/mspec.c
- and probably several out-of-tree modules

[akpm@linux-foundation.org: remove now-unused hugetlb_vma() stub]
Signed-off-by: Cliff Wickman &lt;cpw@sgi.com&gt;
Reviewed-by: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Cc: Mel Gorman &lt;mel@csn.ul.ie&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Dave Hansen &lt;dave.hansen@intel.com&gt;
Cc: David Sterba &lt;dsterba@suse.cz&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@gmail.com&gt;
Cc: "Kirill A. Shutemov" &lt;kirill.shutemov@linux.intel.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
A panic can be caused by simply cat'ing /proc/&lt;pid&gt;/smaps while an
application has a VM_PFNMAP range.  It happened in-house when a
benchmarker was trying to decipher the memory layout of his program.

/proc/&lt;pid&gt;/smaps and similar walks through a user page table should not
be looking at VM_PFNMAP areas.

Certain tests in walk_page_range() (specifically split_huge_page_pmd())
assume that all the mapped PFN's are backed with page structures.  And
this is not usually true for VM_PFNMAP areas.  This can result in panics
on kernel page faults when attempting to address those page structures.

There are a half dozen callers of walk_page_range() that walk through a
task's entire page table (as N.  Horiguchi pointed out).  So rather than
change all of them, this patch changes just walk_page_range() to ignore
VM_PFNMAP areas.

The logic of hugetlb_vma() is moved back into walk_page_range(), as we
want to test any vma in the range.

VM_PFNMAP areas are used by:
- graphics memory manager   gpu/drm/drm_gem.c
- global reference unit     sgi-gru/grufile.c
- sgi special memory        char/mspec.c
- and probably several out-of-tree modules

[akpm@linux-foundation.org: remove now-unused hugetlb_vma() stub]
Signed-off-by: Cliff Wickman &lt;cpw@sgi.com&gt;
Reviewed-by: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Cc: Mel Gorman &lt;mel@csn.ul.ie&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Dave Hansen &lt;dave.hansen@intel.com&gt;
Cc: David Sterba &lt;dsterba@suse.cz&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@gmail.com&gt;
Cc: "Kirill A. Shutemov" &lt;kirill.shutemov@linux.intel.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
