<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/mm, branch v3.6-rc4</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>mm, slab: lock the correct nodelist after reenabling irqs</title>
<updated>2012-08-29T18:32:21+00:00</updated>
<author>
<name>David Rientjes</name>
<email>rientjes@google.com</email>
</author>
<published>2012-08-29T02:57:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=51cd8e6ff265650e35e46b5bcbe2ee381a7a2877'/>
<id>51cd8e6ff265650e35e46b5bcbe2ee381a7a2877</id>
<content type='text'>
cache_grow() can reenable irqs so the cpu (and node) can change, so ensure
that we take list_lock on the correct nodelist.

This fixes an issue with commit 072bb0aa5e06 ("mm: sl[au]b: add
knowledge of PFMEMALLOC reserve pages") where list_lock for the wrong
node was taken after growing the cache.

Reported-and-tested-by: Haggai Eran &lt;haggaie@mellanox.com&gt;
Signed-off-by: David Rientjes &lt;rientjes@google.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
cache_grow() can reenable irqs so the cpu (and node) can change, so ensure
that we take list_lock on the correct nodelist.

This fixes an issue with commit 072bb0aa5e06 ("mm: sl[au]b: add
knowledge of PFMEMALLOC reserve pages") where list_lock for the wrong
node was taken after growing the cache.

Reported-and-tested-by: Haggai Eran &lt;haggaie@mellanox.com&gt;
Signed-off-by: David Rientjes &lt;rientjes@google.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'for-linus' of git://git.kernel.dk/linux-block</title>
<updated>2012-08-25T18:36:43+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2012-08-25T18:36:43+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=a7e546f175f07630453c44b5afe14dd667dcfec9'/>
<id>a7e546f175f07630453c44b5afe14dd667dcfec9</id>
<content type='text'>
Pull block-related fixes from Jens Axboe:

 - Improvements to the buffered and direct write IO plugging from
   Fengguang.

 - Abstract out the mapping of a bio in a request, and use that to
   provide a blk_bio_map_sg() helper.  Useful for mapping just a bio
   instead of a full request.

 - Regression fix from Hugh, fixing up a patch that went into the
   previous release cycle (and marked stable, too) attempting to prevent
   a loop in __getblk_slow().

 - Updates to discard requests, fixing up the sizing and how we align
   them.  Also a change to disallow merging of discard requests, since
   that doesn't really work properly yet.

 - A few drbd fixes.

 - Documentation updates.

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: replace __getblk_slow misfix by grow_dev_page fix
  drbd: Write all pages of the bitmap after an online resize
  drbd: Finish requests that completed while IO was frozen
  drbd: fix drbd wire compatibility for empty flushes
  Documentation: update tunable options in block/cfq-iosched.txt
  Documentation: update tunable options in block/cfq-iosched.txt
  Documentation: update missing index files in block/00-INDEX
  block: move down direct IO plugging
  block: remove plugging at buffered write time
  block: disable discard request merge temporarily
  bio: Fix potential memory leak in bio_find_or_create_slab()
  block: Don't use static to define "void *p" in show_partition_start()
  block: Add blk_bio_map_sg() helper
  block: Introduce __blk_segment_map_sg() helper
  fs/block-dev.c:fix performance regression in O_DIRECT writes to md block devices
  block: split discard into aligned requests
  block: reorganize rounding of max_discard_sectors
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull block-related fixes from Jens Axboe:

 - Improvements to the buffered and direct write IO plugging from
   Fengguang.

 - Abstract out the mapping of a bio in a request, and use that to
   provide a blk_bio_map_sg() helper.  Useful for mapping just a bio
   instead of a full request.

 - Regression fix from Hugh, fixing up a patch that went into the
   previous release cycle (and marked stable, too) attempting to prevent
   a loop in __getblk_slow().

 - Updates to discard requests, fixing up the sizing and how we align
   them.  Also a change to disallow merging of discard requests, since
   that doesn't really work properly yet.

 - A few drbd fixes.

 - Documentation updates.

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: replace __getblk_slow misfix by grow_dev_page fix
  drbd: Write all pages of the bitmap after an online resize
  drbd: Finish requests that completed while IO was frozen
  drbd: fix drbd wire compatibility for empty flushes
  Documentation: update tunable options in block/cfq-iosched.txt
  Documentation: update tunable options in block/cfq-iosched.txt
  Documentation: update missing index files in block/00-INDEX
  block: move down direct IO plugging
  block: remove plugging at buffered write time
  block: disable discard request merge temporarily
  bio: Fix potential memory leak in bio_find_or_create_slab()
  block: Don't use static to define "void *p" in show_partition_start()
  block: Add blk_bio_map_sg() helper
  block: Introduce __blk_segment_map_sg() helper
  fs/block-dev.c:fix performance regression in O_DIRECT writes to md block devices
  block: split discard into aligned requests
  block: reorganize rounding of max_discard_sectors
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2012-08-24T04:48:41+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2012-08-24T04:48:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=7ca63ee1b005623af7c4f3481a2976de3b9fed53'/>
<id>7ca63ee1b005623af7c4f3481a2976de3b9fed53</id>
<content type='text'>
Pull perf fixes from Ingo Molnar:
 "This tree contains misc fixlets: a perf script python binding fix, a
  uprobes fix and a syscall tracing fix."

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf tools: Add missing files to build the python binding
  uprobes: Fix mmap_region()'s mm-&gt;mm_rb corruption if uprobe_mmap() fails
  tracing/syscalls: Fix perf syscall tracing when syscall_nr == -1
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull perf fixes from Ingo Molnar:
 "This tree contains misc fixlets: a perf script python binding fix, a
  uprobes fix and a syscall tracing fix."

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf tools: Add missing files to build the python binding
  uprobes: Fix mmap_region()'s mm-&gt;mm_rb corruption if uprobe_mmap() fails
  tracing/syscalls: Fix perf syscall tracing when syscall_nr == -1
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: compaction: Abort async compaction if locks are contended or taking too long</title>
<updated>2012-08-21T23:45:03+00:00</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2012-08-21T23:16:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=c67fe3752abe6ab47639e2f9b836900c3dc3da84'/>
<id>c67fe3752abe6ab47639e2f9b836900c3dc3da84</id>
<content type='text'>
Jim Schutt reported a problem that pointed at compaction contending
heavily on locks.  The workload is straight-forward and in his own words;

	The systems in question have 24 SAS drives spread across 3 HBAs,
	running 24 Ceph OSD instances, one per drive.  FWIW these servers
	are dual-socket Intel 5675 Xeons w/48 GB memory.  I've got ~160
	Ceph Linux clients doing dd simultaneously to a Ceph file system
	backed by 12 of these servers.

Early in the test everything looks fine

  procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
   r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
  31 15          0     287216        576   38606628    0    0     2  1158    2   14   1  3  95  0  0
  27 15          0     225288        576   38583384    0    0    18 2222016 203357 134876  11 56  17 15  0
  28 17          0     219256        576   38544736    0    0    11 2305932 203141 146296  11 49  23 17  0
   6 18          0     215596        576   38552872    0    0     7 2363207 215264 166502  12 45  22 20  0
  22 18          0     226984        576   38596404    0    0     3 2445741 223114 179527  12 43  23 22  0

and then it goes to pot

  procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
   r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
  163  8          0     464308        576   36791368    0    0    11 22210  866  536   3 13  79  4  0
  207 14          0     917752        576   36181928    0    0   712 1345376 134598 47367   7 90   1  2  0
  123 12          0     685516        576   36296148    0    0   429 1386615 158494 60077   8 84   5  3  0
  123 12          0     598572        576   36333728    0    0  1107 1233281 147542 62351   7 84   5  4  0
  622  7          0     660768        576   36118264    0    0   557 1345548 151394 59353   7 85   4  3  0
  223 11          0     283960        576   36463868    0    0    46 1107160 121846 33006   6 93   1  1  0

Note that system CPU usage is very high blocks being written out has
dropped by 42%. He analysed this with perf and found

  perf record -g -a sleep 10
  perf report --sort symbol --call-graph fractal,5
    34.63%  [k] _raw_spin_lock_irqsave
            |
            |--97.30%-- isolate_freepages
            |          compaction_alloc
            |          unmap_and_move
            |          migrate_pages
            |          compact_zone
            |          compact_zone_order
            |          try_to_compact_pages
            |          __alloc_pages_direct_compact
            |          __alloc_pages_slowpath
            |          __alloc_pages_nodemask
            |          alloc_pages_vma
            |          do_huge_pmd_anonymous_page
            |          handle_mm_fault
            |          do_page_fault
            |          page_fault
            |          |
            |          |--87.39%-- skb_copy_datagram_iovec
            |          |          tcp_recvmsg
            |          |          inet_recvmsg
            |          |          sock_recvmsg
            |          |          sys_recvfrom
            |          |          system_call
            |          |          __recv
            |          |          |
            |          |           --100.00%-- (nil)
            |          |
            |           --12.61%-- memcpy
             --2.70%-- [...]

There was other data but primarily it is all showing that compaction is
contended heavily on the zone-&gt;lock and zone-&gt;lru_lock.

commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
while isolating pages for migration] noted that it was possible for
migration to hold the lru_lock for an excessive amount of time. Very
broadly speaking this patch expands the concept.

This patch introduces compact_checklock_irqsave() to check if a lock
is contended or the process needs to be scheduled. If either condition
is true then async compaction is aborted and the caller is informed.
The page allocator will fail a THP allocation if compaction failed due
to contention. This patch also introduces compact_trylock_irqsave()
which will acquire the lock only if it is not contended and the process
does not need to schedule.

Reported-by: Jim Schutt &lt;jaschut@sandia.gov&gt;
Tested-by: Jim Schutt &lt;jaschut@sandia.gov&gt;
Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Jim Schutt reported a problem that pointed at compaction contending
heavily on locks.  The workload is straight-forward and in his own words;

	The systems in question have 24 SAS drives spread across 3 HBAs,
	running 24 Ceph OSD instances, one per drive.  FWIW these servers
	are dual-socket Intel 5675 Xeons w/48 GB memory.  I've got ~160
	Ceph Linux clients doing dd simultaneously to a Ceph file system
	backed by 12 of these servers.

Early in the test everything looks fine

  procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
   r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
  31 15          0     287216        576   38606628    0    0     2  1158    2   14   1  3  95  0  0
  27 15          0     225288        576   38583384    0    0    18 2222016 203357 134876  11 56  17 15  0
  28 17          0     219256        576   38544736    0    0    11 2305932 203141 146296  11 49  23 17  0
   6 18          0     215596        576   38552872    0    0     7 2363207 215264 166502  12 45  22 20  0
  22 18          0     226984        576   38596404    0    0     3 2445741 223114 179527  12 43  23 22  0

and then it goes to pot

  procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
   r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
  163  8          0     464308        576   36791368    0    0    11 22210  866  536   3 13  79  4  0
  207 14          0     917752        576   36181928    0    0   712 1345376 134598 47367   7 90   1  2  0
  123 12          0     685516        576   36296148    0    0   429 1386615 158494 60077   8 84   5  3  0
  123 12          0     598572        576   36333728    0    0  1107 1233281 147542 62351   7 84   5  4  0
  622  7          0     660768        576   36118264    0    0   557 1345548 151394 59353   7 85   4  3  0
  223 11          0     283960        576   36463868    0    0    46 1107160 121846 33006   6 93   1  1  0

Note that system CPU usage is very high blocks being written out has
dropped by 42%. He analysed this with perf and found

  perf record -g -a sleep 10
  perf report --sort symbol --call-graph fractal,5
    34.63%  [k] _raw_spin_lock_irqsave
            |
            |--97.30%-- isolate_freepages
            |          compaction_alloc
            |          unmap_and_move
            |          migrate_pages
            |          compact_zone
            |          compact_zone_order
            |          try_to_compact_pages
            |          __alloc_pages_direct_compact
            |          __alloc_pages_slowpath
            |          __alloc_pages_nodemask
            |          alloc_pages_vma
            |          do_huge_pmd_anonymous_page
            |          handle_mm_fault
            |          do_page_fault
            |          page_fault
            |          |
            |          |--87.39%-- skb_copy_datagram_iovec
            |          |          tcp_recvmsg
            |          |          inet_recvmsg
            |          |          sock_recvmsg
            |          |          sys_recvfrom
            |          |          system_call
            |          |          __recv
            |          |          |
            |          |           --100.00%-- (nil)
            |          |
            |           --12.61%-- memcpy
             --2.70%-- [...]

There was other data but primarily it is all showing that compaction is
contended heavily on the zone-&gt;lock and zone-&gt;lru_lock.

commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
while isolating pages for migration] noted that it was possible for
migration to hold the lru_lock for an excessive amount of time. Very
broadly speaking this patch expands the concept.

This patch introduces compact_checklock_irqsave() to check if a lock
is contended or the process needs to be scheduled. If either condition
is true then async compaction is aborted and the caller is informed.
The page allocator will fail a THP allocation if compaction failed due
to contention. This patch also introduces compact_trylock_irqsave()
which will acquire the lock only if it is not contended and the process
does not need to schedule.

Reported-by: Jim Schutt &lt;jaschut@sandia.gov&gt;
Tested-by: Jim Schutt &lt;jaschut@sandia.gov&gt;
Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: have order &gt; 0 compaction start near a pageblock with free pages</title>
<updated>2012-08-21T23:45:03+00:00</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2012-08-21T23:16:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=de74f1cc3b1e9730d9b58580cd11361d30cd182d'/>
<id>de74f1cc3b1e9730d9b58580cd11361d30cd182d</id>
<content type='text'>
Commit 7db8889ab05b ("mm: have order &gt; 0 compaction start off where it
left") introduced a caching mechanism to reduce the amount work the free
page scanner does in compaction.  However, it has a problem.  Consider
two process simultaneously scanning free pages

					    			C
	Process A		M     S     			F
			|---------------------------------------|
	Process B		M 	FS

	C is zone-&gt;compact_cached_free_pfn
	S is cc-&gt;start_pfree_pfn
	M is cc-&gt;migrate_pfn
	F is cc-&gt;free_pfn

In this diagram, Process A has just reached its migrate scanner, wrapped
around and updated compact_cached_free_pfn accordingly.

Simultaneously, Process B finishes isolating in a block and updates
compact_cached_free_pfn again to the location of its free scanner.

Process A moves to "end_of_zone - one_pageblock" and runs this check

                if (cc-&gt;order &gt; 0 &amp;&amp; (!cc-&gt;wrapped ||
                                      zone-&gt;compact_cached_free_pfn &gt;
                                      cc-&gt;start_free_pfn))
                        pfn = min(pfn, zone-&gt;compact_cached_free_pfn);

compact_cached_free_pfn is above where it started so the free scanner
skips almost the entire space it should have scanned.  When there are
multiple processes compacting it can end in a situation where the entire
zone is not being scanned at all.  Further, it is possible for two
processes to ping-pong update to compact_cached_free_pfn which is just
random.

Overall, the end result wrecks allocation success rates.

There is not an obvious way around this problem without introducing new
locking and state so this patch takes a different approach.

First, it gets rid of the skip logic because it's not clear that it
matters if two free scanners happen to be in the same block but with
racing updates it's too easy for it to skip over blocks it should not.

Second, it updates compact_cached_free_pfn in a more limited set of
circumstances.

If a scanner has wrapped, it updates compact_cached_free_pfn to the end
	of the zone. When a wrapped scanner isolates a page, it updates
	compact_cached_free_pfn to point to the highest pageblock it
	can isolate pages from.

If a scanner has not wrapped when it has finished isolated pages it
	checks if compact_cached_free_pfn is pointing to the end of the
	zone. If so, the value is updated to point to the highest
	pageblock that pages were isolated from. This value will not
	be updated again until a free page scanner wraps and resets
	compact_cached_free_pfn.

This is not optimal and it can still race but the compact_cached_free_pfn
will be pointing to or very near a pageblock with free pages.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Reviewed-by: Rik van Riel &lt;riel@redhat.com&gt;
Reviewed-by: Minchan Kim &lt;minchan@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Commit 7db8889ab05b ("mm: have order &gt; 0 compaction start off where it
left") introduced a caching mechanism to reduce the amount work the free
page scanner does in compaction.  However, it has a problem.  Consider
two process simultaneously scanning free pages

					    			C
	Process A		M     S     			F
			|---------------------------------------|
	Process B		M 	FS

	C is zone-&gt;compact_cached_free_pfn
	S is cc-&gt;start_pfree_pfn
	M is cc-&gt;migrate_pfn
	F is cc-&gt;free_pfn

In this diagram, Process A has just reached its migrate scanner, wrapped
around and updated compact_cached_free_pfn accordingly.

Simultaneously, Process B finishes isolating in a block and updates
compact_cached_free_pfn again to the location of its free scanner.

Process A moves to "end_of_zone - one_pageblock" and runs this check

                if (cc-&gt;order &gt; 0 &amp;&amp; (!cc-&gt;wrapped ||
                                      zone-&gt;compact_cached_free_pfn &gt;
                                      cc-&gt;start_free_pfn))
                        pfn = min(pfn, zone-&gt;compact_cached_free_pfn);

compact_cached_free_pfn is above where it started so the free scanner
skips almost the entire space it should have scanned.  When there are
multiple processes compacting it can end in a situation where the entire
zone is not being scanned at all.  Further, it is possible for two
processes to ping-pong update to compact_cached_free_pfn which is just
random.

Overall, the end result wrecks allocation success rates.

There is not an obvious way around this problem without introducing new
locking and state so this patch takes a different approach.

First, it gets rid of the skip logic because it's not clear that it
matters if two free scanners happen to be in the same block but with
racing updates it's too easy for it to skip over blocks it should not.

Second, it updates compact_cached_free_pfn in a more limited set of
circumstances.

If a scanner has wrapped, it updates compact_cached_free_pfn to the end
	of the zone. When a wrapped scanner isolates a page, it updates
	compact_cached_free_pfn to point to the highest pageblock it
	can isolate pages from.

If a scanner has not wrapped when it has finished isolated pages it
	checks if compact_cached_free_pfn is pointing to the end of the
	zone. If so, the value is updated to point to the highest
	pageblock that pages were isolated from. This value will not
	be updated again until a free page scanner wraps and resets
	compact_cached_free_pfn.

This is not optimal and it can still race but the compact_cached_free_pfn
will be pointing to or very near a pageblock with free pages.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Reviewed-by: Rik van Riel &lt;riel@redhat.com&gt;
Reviewed-by: Minchan Kim &lt;minchan@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: correct page-&gt;pfmemalloc to fix deactivate_slab regression</title>
<updated>2012-08-21T23:45:03+00:00</updated>
<author>
<name>Alex Shi</name>
<email>alex.shi@intel.com</email>
</author>
<published>2012-08-21T23:16:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=b121186ab1b12e2a96a945d88eae0735b4542158'/>
<id>b121186ab1b12e2a96a945d88eae0735b4542158</id>
<content type='text'>
Commit cfd19c5a9ecf ("mm: only set page-&gt;pfmemalloc when
ALLOC_NO_WATERMARKS was used") tried to narrow down page-&gt;pfmemalloc
setting, but it missed some places the pfmemalloc should be set.

So, in __slab_alloc, the unalignment pfmemalloc and ALLOC_NO_WATERMARKS
cause incorrect deactivate_slab() on our core2 server:

    64.73%           fio  [kernel.kallsyms]     [k] _raw_spin_lock
                     |
                     --- _raw_spin_lock
                        |
                        |---0.34%-- deactivate_slab
                        |          __slab_alloc
                        |          kmem_cache_alloc
                        |          |

That causes our fio sync write performance to have a 40% regression.

Move the checking in get_page_from_freelist() which resolves this issue.

Signed-off-by: Alex Shi &lt;alex.shi@intel.com&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: David Miller &lt;davem@davemloft.net
Cc: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Tested-by: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Tested-by: Sage Weil &lt;sage@inktank.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Commit cfd19c5a9ecf ("mm: only set page-&gt;pfmemalloc when
ALLOC_NO_WATERMARKS was used") tried to narrow down page-&gt;pfmemalloc
setting, but it missed some places the pfmemalloc should be set.

So, in __slab_alloc, the unalignment pfmemalloc and ALLOC_NO_WATERMARKS
cause incorrect deactivate_slab() on our core2 server:

    64.73%           fio  [kernel.kallsyms]     [k] _raw_spin_lock
                     |
                     --- _raw_spin_lock
                        |
                        |---0.34%-- deactivate_slab
                        |          __slab_alloc
                        |          kmem_cache_alloc
                        |          |

That causes our fio sync write performance to have a 40% regression.

Move the checking in get_page_from_freelist() which resolves this issue.

Signed-off-by: Alex Shi &lt;alex.shi@intel.com&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: David Miller &lt;davem@davemloft.net
Cc: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Tested-by: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Tested-by: Sage Weil &lt;sage@inktank.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm/compaction.c: fix deferring compaction mistake</title>
<updated>2012-08-21T23:45:03+00:00</updated>
<author>
<name>Minchan Kim</name>
<email>minchan@kernel.org</email>
</author>
<published>2012-08-21T23:16:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=c81758fbe0fdbbc0c74b37798f55bd9c91d5c068'/>
<id>c81758fbe0fdbbc0c74b37798f55bd9c91d5c068</id>
<content type='text'>
Commit aff622495c9a ("vmscan: only defer compaction for failed order and
higher") fixed bad deferring policy but made mistake about checking
compact_order_failed in __compact_pgdat().  So it can't update
compact_order_failed with the new order.  This ends up preventing
correct operation of policy deferral.  This patch fixes it.

Signed-off-by: Minchan Kim &lt;minchan@kernel.org&gt;
Reviewed-by: Rik van Riel &lt;riel@redhat.com&gt;
Acked-by: Mel Gorman &lt;mel@csn.ul.ie&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Commit aff622495c9a ("vmscan: only defer compaction for failed order and
higher") fixed bad deferring policy but made mistake about checking
compact_order_failed in __compact_pgdat().  So it can't update
compact_order_failed with the new order.  This ends up preventing
correct operation of policy deferral.  This patch fixes it.

Signed-off-by: Minchan Kim &lt;minchan@kernel.org&gt;
Reviewed-by: Rik van Riel &lt;riel@redhat.com&gt;
Acked-by: Mel Gorman &lt;mel@csn.ul.ie&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: change nr_ptes BUG_ON to WARN_ON</title>
<updated>2012-08-21T23:45:02+00:00</updated>
<author>
<name>Hugh Dickins</name>
<email>hughd@google.com</email>
</author>
<published>2012-08-21T23:15:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=f9aed62a2b12a8e04077737c9942111e14ed738e'/>
<id>f9aed62a2b12a8e04077737c9942111e14ed738e</id>
<content type='text'>
Occasionally an isolated BUG_ON(mm-&gt;nr_ptes) gets reported, indicating
that not all the page tables allocated could be found and freed when
exit_mmap() tore down the user address space.

There's usually nothing we can say about it, beyond that it's probably a
sign of some bad memory or memory corruption; though it might still
indicate a bug in vma or page table management (and did recently reveal a
race in THP, fixed a few months ago).

But one overdue change we can make is from BUG_ON to WARN_ON.

It's fairly likely that the system will crash shortly afterwards in some
other way (for example, the BUG_ON(page_mapped(page)) in
__delete_from_page_cache(), once an inode mapped into the lost page tables
gets evicted); but might tell us more before that.

Change the BUG_ON(page_mapped) to WARN_ON too?  Later perhaps: I'm less
eager, since that one has several times led to fixes.

Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Occasionally an isolated BUG_ON(mm-&gt;nr_ptes) gets reported, indicating
that not all the page tables allocated could be found and freed when
exit_mmap() tore down the user address space.

There's usually nothing we can say about it, beyond that it's probably a
sign of some bad memory or memory corruption; though it might still
indicate a bug in vma or page table management (and did recently reveal a
race in THP, fixed a few months ago).

But one overdue change we can make is from BUG_ON to WARN_ON.

It's fairly likely that the system will crash shortly afterwards in some
other way (for example, the BUG_ON(page_mapped(page)) in
__delete_from_page_cache(), once an inode mapped into the lost page tables
gets evicted); but might tell us more before that.

Change the BUG_ON(page_mapped) to WARN_ON too?  Later perhaps: I'm less
eager, since that one has several times led to fixes.

Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>uprobes: Fix mmap_region()'s mm-&gt;mm_rb corruption if uprobe_mmap() fails</title>
<updated>2012-08-21T09:48:12+00:00</updated>
<author>
<name>Oleg Nesterov</name>
<email>oleg@redhat.com</email>
</author>
<published>2012-08-19T17:10:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=c7a3a88c938fbe3d70c2278e082b80eb830d1c58'/>
<id>c7a3a88c938fbe3d70c2278e082b80eb830d1c58</id>
<content type='text'>
This patch fixes:

  https://bugzilla.redhat.com/show_bug.cgi?id=843640

If mmap_region()-&gt;uprobe_mmap() fails, unmap_and_free_vma path
does unmap_region() but does not remove the soon-to-be-freed vma
from rb tree. Actually there are more problems but this is how
William noticed this bug.

Perhaps we could do do_munmap() + return in this case, but in
fact it is simply wrong to abort if uprobe_mmap() fails. Until
at least we move the !UPROBE_COPY_INSN code from
install_breakpoint() to uprobe_register().

For example, uprobe_mmap()-&gt;install_breakpoint() can fail if the
probed insn is not supported (remember, uprobe_register()
succeeds if nobody mmaps inode/offset), mmap() should not fail
in this case.

dup_mmap()-&gt;uprobe_mmap() is wrong too by the same reason,
fork() can race with uprobe_register() and fail for no reason if
it wins the race and does install_breakpoint() first.

And, if nothing else, both mmap_region() and dup_mmap() return
success if uprobe_mmap() fails. Change them to ignore the error
code from uprobe_mmap().

Reported-and-tested-by: William Cohen &lt;wcohen@redhat.com&gt;
Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Acked-by: Srikar Dronamraju &lt;srikar@linux.vnet.ibm.com&gt;
Cc: &lt;stable@vger.kernel.org&gt; # v3.5
Cc: Anton Arapov &lt;anton@redhat.com&gt;
Cc: William Cohen &lt;wcohen@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/20120819171042.GB26957@redhat.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This patch fixes:

  https://bugzilla.redhat.com/show_bug.cgi?id=843640

If mmap_region()-&gt;uprobe_mmap() fails, unmap_and_free_vma path
does unmap_region() but does not remove the soon-to-be-freed vma
from rb tree. Actually there are more problems but this is how
William noticed this bug.

Perhaps we could do do_munmap() + return in this case, but in
fact it is simply wrong to abort if uprobe_mmap() fails. Until
at least we move the !UPROBE_COPY_INSN code from
install_breakpoint() to uprobe_register().

For example, uprobe_mmap()-&gt;install_breakpoint() can fail if the
probed insn is not supported (remember, uprobe_register()
succeeds if nobody mmaps inode/offset), mmap() should not fail
in this case.

dup_mmap()-&gt;uprobe_mmap() is wrong too by the same reason,
fork() can race with uprobe_register() and fail for no reason if
it wins the race and does install_breakpoint() first.

And, if nothing else, both mmap_region() and dup_mmap() return
success if uprobe_mmap() fails. Change them to ignore the error
code from uprobe_mmap().

Reported-and-tested-by: William Cohen &lt;wcohen@redhat.com&gt;
Signed-off-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Acked-by: Srikar Dronamraju &lt;srikar@linux.vnet.ibm.com&gt;
Cc: &lt;stable@vger.kernel.org&gt; # v3.5
Cc: Anton Arapov &lt;anton@redhat.com&gt;
Cc: William Cohen &lt;wcohen@redhat.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: http://lkml.kernel.org/r/20120819171042.GB26957@redhat.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>block: move down direct IO plugging</title>
<updated>2012-08-09T13:23:09+00:00</updated>
<author>
<name>Fengguang Wu</name>
<email>fengguang.wu@intel.com</email>
</author>
<published>2012-08-09T13:23:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=647d1e4c5235763b83fbfe74a09d148edc6ca152'/>
<id>647d1e4c5235763b83fbfe74a09d148edc6ca152</id>
<content type='text'>
Move unplugging for direct I/O from around -&gt;direct_IO() down to
do_blockdev_direct_IO(). This implicitly adds plugging for direct
writes.

CC: Li Shaohua &lt;shli@fusionio.com&gt;
Acked-by: Jeff Moyer &lt;jmoyer@redhat.com&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Move unplugging for direct I/O from around -&gt;direct_IO() down to
do_blockdev_direct_IO(). This implicitly adds plugging for direct
writes.

CC: Li Shaohua &lt;shli@fusionio.com&gt;
Acked-by: Jeff Moyer &lt;jmoyer@redhat.com&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</pre>
</div>
</content>
</entry>
</feed>
