<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/mm/compaction.c, branch v3.7.7</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>mm: compaction: partially revert capture of suitable high-order page</title>
<updated>2013-01-17T16:46:10+00:00</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2013-01-11T22:32:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=e5e65a5233c3f368c442413cebe7de2a4580b564'/>
<id>e5e65a5233c3f368c442413cebe7de2a4580b564</id>
<content type='text'>
commit 8fb74b9fb2b182d54beee592350d9ea1f325917a upstream.

Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when
waiting for POLLIN on a local TCP socket.  It was easier to trigger if
there was disk IO and dirty pages at the same time and he bisected it to
commit 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page
immediately when it is made available").

The intention of that patch was to improve high-order allocations under
memory pressure after changes made to reclaim in 3.6 drastically hurt
THP allocations but the approach was flawed.  For Eric, the problem was
that page-&gt;pfmemalloc was not being cleared for captured pages leading
to a poor interaction with swap-over-NFS support causing the packets to
be dropped.  However, I identified a few more problems with the patch
including the fact that it can increase contention on zone-&gt;lock in some
cases which could result in async direct compaction being aborted early.

In retrospect the capture patch took the wrong approach.  What it should
have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it
was allocating for THP and avoided races that way.  While the patch was
showing to improve allocation success rates at the time, the benefit is
marginal given the relative complexity and it should be revisited from
scratch in the context of the other reclaim-related changes that have
taken place since the patch was first written and tested.  This patch
partially reverts commit 1fb3f8ca0e92 ("mm: compaction: capture a
suitable high-order page immediately when it is made available").

Reported-and-tested-by: Eric Wong &lt;normalperson@yhbt.net&gt;
Tested-by: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: David Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 8fb74b9fb2b182d54beee592350d9ea1f325917a upstream.

Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when
waiting for POLLIN on a local TCP socket.  It was easier to trigger if
there was disk IO and dirty pages at the same time and he bisected it to
commit 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page
immediately when it is made available").

The intention of that patch was to improve high-order allocations under
memory pressure after changes made to reclaim in 3.6 drastically hurt
THP allocations but the approach was flawed.  For Eric, the problem was
that page-&gt;pfmemalloc was not being cleared for captured pages leading
to a poor interaction with swap-over-NFS support causing the packets to
be dropped.  However, I identified a few more problems with the patch
including the fact that it can increase contention on zone-&gt;lock in some
cases which could result in async direct compaction being aborted early.

In retrospect the capture patch took the wrong approach.  What it should
have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it
was allocating for THP and avoided races that way.  While the patch was
showing to improve allocation success rates at the time, the benefit is
marginal given the relative complexity and it should be revisited from
scratch in the context of the other reclaim-related changes that have
taken place since the patch was first written and tested.  This patch
partially reverts commit 1fb3f8ca0e92 ("mm: compaction: capture a
suitable high-order page immediately when it is made available").

Reported-and-tested-by: Eric Wong &lt;normalperson@yhbt.net&gt;
Tested-by: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: David Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>mm: compaction: fix echo 1 &gt; compact_memory return error issue</title>
<updated>2013-01-17T16:45:54+00:00</updated>
<author>
<name>Jason Liu</name>
<email>r64343@freescale.com</email>
</author>
<published>2013-01-11T22:31:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=b743b4f99918fee303eead2b877d1a6daa306a40'/>
<id>b743b4f99918fee303eead2b877d1a6daa306a40</id>
<content type='text'>
commit 7964c06d66c76507d8b6b662bffea770c29ef0ce upstream.

when run the folloing command under shell, it will return error

  sh/$ echo 1 &gt; /proc/sys/vm/compact_memory
  sh/$ sh: write error: Bad address

After strace, I found the following log:

  ...
  write(1, "1\n", 2)               = 3
  write(1, "", 4294967295)         = -1 EFAULT (Bad address)
  write(2, "echo: write error: Bad address\n", 31echo: write error: Bad address
  ) = 31

This tells system return 3(COMPACT_COMPLETE) after write data to
compact_memory.

The fix is to make the system just return 0 instead 3(COMPACT_COMPLETE)
from sysctl_compaction_handler after compaction_nodes finished.

Signed-off-by: Jason Liu &lt;r64343@freescale.com&gt;
Suggested-by: David Rientjes &lt;rientjes@google.com&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Acked-by: David Rientjes &lt;rientjes@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 7964c06d66c76507d8b6b662bffea770c29ef0ce upstream.

when run the folloing command under shell, it will return error

  sh/$ echo 1 &gt; /proc/sys/vm/compact_memory
  sh/$ sh: write error: Bad address

After strace, I found the following log:

  ...
  write(1, "1\n", 2)               = 3
  write(1, "", 4294967295)         = -1 EFAULT (Bad address)
  write(2, "echo: write error: Bad address\n", 31echo: write error: Bad address
  ) = 31

This tells system return 3(COMPACT_COMPLETE) after write data to
compact_memory.

The fix is to make the system just return 0 instead 3(COMPACT_COMPLETE)
from sysctl_compaction_handler after compaction_nodes finished.

Signed-off-by: Jason Liu &lt;r64343@freescale.com&gt;
Suggested-by: David Rientjes &lt;rientjes@google.com&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Acked-by: David Rientjes &lt;rientjes@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>mm: compaction: validate pfn range passed to isolate_freepages_block</title>
<updated>2012-12-06T19:17:33+00:00</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2012-12-06T19:01:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=60177d31d215bc2b4c5a7aa6f742800e04fa0a92'/>
<id>60177d31d215bc2b4c5a7aa6f742800e04fa0a92</id>
<content type='text'>
Commit 0bf380bc70ec ("mm: compaction: check pfn_valid when entering a
new MAX_ORDER_NR_PAGES block during isolation for migration") added a
check for pfn_valid() when isolating pages for migration as the scanner
does not necessarily start pageblock-aligned.

Since commit c89511ab2f8f ("mm: compaction: Restart compaction from near
where it left off"), the free scanner has the same problem.  This patch
makes sure that the pfn range passed to isolate_freepages_block() is
within the same block so that pfn_valid() checks are unnecessary.

In answer to Henrik's wondering why others have not reported this:
reproducing this requires a large enough hole with the right aligment to
have compaction walk into a PFN range with no memmap.  Size and
alignment depends in the memory model - 4M for FLATMEM and 128M for
SPARSEMEM on x86.  It needs a "lucky" machine.

Reported-by: Henrik Rydberg &lt;rydberg@euromail.se&gt;
Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Commit 0bf380bc70ec ("mm: compaction: check pfn_valid when entering a
new MAX_ORDER_NR_PAGES block during isolation for migration") added a
check for pfn_valid() when isolating pages for migration as the scanner
does not necessarily start pageblock-aligned.

Since commit c89511ab2f8f ("mm: compaction: Restart compaction from near
where it left off"), the free scanner has the same problem.  This patch
makes sure that the pfn range passed to isolate_freepages_block() is
within the same block so that pfn_valid() checks are unnecessary.

In answer to Henrik's wondering why others have not reported this:
reproducing this requires a large enough hole with the right aligment to
have compaction walk into a PFN range with no memmap.  Size and
alignment depends in the memory model - 4M for FLATMEM and 128M for
SPARSEMEM on x86.  It needs a "lucky" machine.

Reported-by: Henrik Rydberg &lt;rydberg@euromail.se&gt;
Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: compaction: correct the nr_strict va isolated check for CMA</title>
<updated>2012-10-19T21:07:47+00:00</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2012-10-19T20:56:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=0db63d7e25f96e2c6da925c002badf6f144ddf30'/>
<id>0db63d7e25f96e2c6da925c002badf6f144ddf30</id>
<content type='text'>
Thierry reported that the "iron out" patch for isolate_freepages_block()
had problems due to the strict check being too strict with "mm:
compaction: Iron out isolate_freepages_block() and
isolate_freepages_range() -fix1".  It's possible that more pages than
necessary are isolated but the check still fails and I missed that this
fix was not picked up before RC1.  This same problem has been identified
in 3.7-RC1 by Tony Prisk and should be addressed by the following patch.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Tested-by: Tony Prisk &lt;linux@prisktech.co.nz&gt;
Reported-by: Thierry Reding &lt;thierry.reding@avionic-design.de&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Acked-by: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Richard Davies &lt;richard@arachsys.com&gt;
Cc: Shaohua Li &lt;shli@kernel.org&gt;
Cc: Avi Kivity &lt;avi@redhat.com&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Thierry reported that the "iron out" patch for isolate_freepages_block()
had problems due to the strict check being too strict with "mm:
compaction: Iron out isolate_freepages_block() and
isolate_freepages_range() -fix1".  It's possible that more pages than
necessary are isolated but the check still fails and I missed that this
fix was not picked up before RC1.  This same problem has been identified
in 3.7-RC1 by Tony Prisk and should be addressed by the following patch.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Tested-by: Tony Prisk &lt;linux@prisktech.co.nz&gt;
Reported-by: Thierry Reding &lt;thierry.reding@avionic-design.de&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Acked-by: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Richard Davies &lt;richard@arachsys.com&gt;
Cc: Shaohua Li &lt;shli@kernel.org&gt;
Cc: Avi Kivity &lt;avi@redhat.com&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>CMA: migrate mlocked pages</title>
<updated>2012-10-09T07:23:00+00:00</updated>
<author>
<name>Minchan Kim</name>
<email>minchan@kernel.org</email>
</author>
<published>2012-10-08T23:33:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=e46a28790e594c0876d1a84270926abf75460f61'/>
<id>e46a28790e594c0876d1a84270926abf75460f61</id>
<content type='text'>
Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
contiguous memory space.

This patch makes mlocked pages be migrated out.  Of course, it can affect
realtime processes but in CMA usecase, contiguous memory allocation failing
is far worse than access latency to an mlocked page being variable while
CMA is running.  If someone wants to make the system realtime, he shouldn't
enable CMA because stalls can still happen at random times.

[akpm@linux-foundation.org: tweak comment text, per Mel]
Signed-off-by: Minchan Kim &lt;minchan@kernel.org&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Michal Nazarewicz &lt;mina86@mina86.com&gt;
Cc: Bartlomiej Zolnierkiewicz &lt;b.zolnierkie@samsung.com&gt;
Cc: Marek Szyprowski &lt;m.szyprowski@samsung.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
contiguous memory space.

This patch makes mlocked pages be migrated out.  Of course, it can affect
realtime processes but in CMA usecase, contiguous memory allocation failing
is far worse than access latency to an mlocked page being variable while
CMA is running.  If someone wants to make the system realtime, he shouldn't
enable CMA because stalls can still happen at random times.

[akpm@linux-foundation.org: tweak comment text, per Mel]
Signed-off-by: Minchan Kim &lt;minchan@kernel.org&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Michal Nazarewicz &lt;mina86@mina86.com&gt;
Cc: Bartlomiej Zolnierkiewicz &lt;b.zolnierkie@samsung.com&gt;
Cc: Marek Szyprowski &lt;m.szyprowski@samsung.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity</title>
<updated>2012-10-09T07:22:51+00:00</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2012-10-08T23:32:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=62997027ca5b3d4618198ed8b1aba40b61b1137b'/>
<id>62997027ca5b3d4618198ed8b1aba40b61b1137b</id>
<content type='text'>
Compaction caches if a pageblock was scanned and no pages were isolated so
that the pageblocks can be skipped in the future to reduce scanning.  This
information is not cleared by the page allocator based on activity due to
the impact it would have to the page allocator fast paths.  Hence there is
a requirement that something clear the cache or pageblocks will be skipped
forever.  Currently the cache is cleared if there were a number of recent
allocation failures and it has not been cleared within the last 5 seconds.
Time-based decisions like this are terrible as they have no relationship
to VM activity and is basically a big hammer.

Unfortunately, accurate heuristics would add cost to some hot paths so
this patch implements a rough heuristic.  There are two cases where the
cache is cleared.

1. If a !kswapd process completes a compaction cycle (migrate and free
   scanner meet), the zone is marked compact_blockskip_flush. When kswapd
   goes to sleep, it will clear the cache. This is expected to be the
   common case where the cache is cleared. It does not really matter if
   kswapd happens to be asleep or going to sleep when the flag is set as
   it will be woken on the next allocation request.

2. If there have been multiple failures recently and compaction just
   finished being deferred then a process will clear the cache and start a
   full scan.  This situation happens if there are multiple high-order
   allocation requests under heavy memory pressure.

The clearing of the PG_migrate_skip bits and other scans is inherently
racy but the race is harmless.  For allocations that can fail such as THP,
they will simply fail.  For requests that cannot fail, they will retry the
allocation.  Tests indicated that scanning rates were roughly similar to
when the time-based heuristic was used and the allocation success rates
were similar.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Richard Davies &lt;richard@arachsys.com&gt;
Cc: Shaohua Li &lt;shli@kernel.org&gt;
Cc: Avi Kivity &lt;avi@redhat.com&gt;
Cc: Rafael Aquini &lt;aquini@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Compaction caches if a pageblock was scanned and no pages were isolated so
that the pageblocks can be skipped in the future to reduce scanning.  This
information is not cleared by the page allocator based on activity due to
the impact it would have to the page allocator fast paths.  Hence there is
a requirement that something clear the cache or pageblocks will be skipped
forever.  Currently the cache is cleared if there were a number of recent
allocation failures and it has not been cleared within the last 5 seconds.
Time-based decisions like this are terrible as they have no relationship
to VM activity and is basically a big hammer.

Unfortunately, accurate heuristics would add cost to some hot paths so
this patch implements a rough heuristic.  There are two cases where the
cache is cleared.

1. If a !kswapd process completes a compaction cycle (migrate and free
   scanner meet), the zone is marked compact_blockskip_flush. When kswapd
   goes to sleep, it will clear the cache. This is expected to be the
   common case where the cache is cleared. It does not really matter if
   kswapd happens to be asleep or going to sleep when the flag is set as
   it will be woken on the next allocation request.

2. If there have been multiple failures recently and compaction just
   finished being deferred then a process will clear the cache and start a
   full scan.  This situation happens if there are multiple high-order
   allocation requests under heavy memory pressure.

The clearing of the PG_migrate_skip bits and other scans is inherently
racy but the race is harmless.  For allocations that can fail such as THP,
they will simply fail.  For requests that cannot fail, they will retry the
allocation.  Tests indicated that scanning rates were roughly similar to
when the time-based heuristic was used and the allocation success rates
were similar.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Richard Davies &lt;richard@arachsys.com&gt;
Cc: Shaohua Li &lt;shli@kernel.org&gt;
Cc: Avi Kivity &lt;avi@redhat.com&gt;
Cc: Rafael Aquini &lt;aquini@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: compaction: Restart compaction from near where it left off</title>
<updated>2012-10-09T07:22:50+00:00</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2012-10-08T23:32:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=c89511ab2f8fe2b47585e60da8af7fd213ec877e'/>
<id>c89511ab2f8fe2b47585e60da8af7fd213ec877e</id>
<content type='text'>
This is almost entirely based on Rik's previous patches and discussions
with him about how this might be implemented.

Order &gt; 0 compaction stops when enough free pages of the correct page
order have been coalesced.  When doing subsequent higher order
allocations, it is possible for compaction to be invoked many times.

However, the compaction code always starts out looking for things to
compact at the start of the zone, and for free pages to compact things to
at the end of the zone.

This can cause quadratic behaviour, with isolate_freepages starting at the
end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.
 This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.

This patch caches where the migration and free scanner should start from
on subsequent compaction invocations using the pageblock-skip information.
 When compaction starts it begins from the cached restart points and will
update the cached restart points until a page is isolated or a pageblock
is skipped that would have been scanned by synchronous compaction.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Richard Davies &lt;richard@arachsys.com&gt;
Cc: Shaohua Li &lt;shli@kernel.org&gt;
Cc: Avi Kivity &lt;avi@redhat.com&gt;
Acked-by: Rafael Aquini &lt;aquini@redhat.com&gt;
Cc: Fengguang Wu &lt;fengguang.wu@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This is almost entirely based on Rik's previous patches and discussions
with him about how this might be implemented.

Order &gt; 0 compaction stops when enough free pages of the correct page
order have been coalesced.  When doing subsequent higher order
allocations, it is possible for compaction to be invoked many times.

However, the compaction code always starts out looking for things to
compact at the start of the zone, and for free pages to compact things to
at the end of the zone.

This can cause quadratic behaviour, with isolate_freepages starting at the
end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.
 This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.

This patch caches where the migration and free scanner should start from
on subsequent compaction invocations using the pageblock-skip information.
 When compaction starts it begins from the cached restart points and will
update the cached restart points until a page is isolated or a pageblock
is skipped that would have been scanned by synchronous compaction.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Richard Davies &lt;richard@arachsys.com&gt;
Cc: Shaohua Li &lt;shli@kernel.org&gt;
Cc: Avi Kivity &lt;avi@redhat.com&gt;
Acked-by: Rafael Aquini &lt;aquini@redhat.com&gt;
Cc: Fengguang Wu &lt;fengguang.wu@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: compaction: cache if a pageblock was scanned and no pages were isolated</title>
<updated>2012-10-09T07:22:50+00:00</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2012-10-08T23:32:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=bb13ffeb9f6bfeb301443994dfbf29f91117dfb3'/>
<id>bb13ffeb9f6bfeb301443994dfbf29f91117dfb3</id>
<content type='text'>
When compaction was implemented it was known that scanning could
potentially be excessive.  The ideal was that a counter be maintained for
each pageblock but maintaining this information would incur a severe
penalty due to a shared writable cache line.  It has reached the point
where the scanning costs are a serious problem, particularly on
long-lived systems where a large process starts and allocates a large
number of THPs at the same time.

Instead of using a shared counter, this patch adds another bit to the
pageblock flags called PG_migrate_skip.  If a pageblock is scanned by
either migrate or free scanner and 0 pages were isolated, the pageblock is
marked to be skipped in the future.  When scanning, this bit is checked
before any scanning takes place and the block skipped if set.

The main difficulty with a patch like this is "when to ignore the cached
information?" If it's ignored too often, the scanning rates will still be
excessive.  If the information is too stale then allocations will fail
that might have otherwise succeeded.  In this patch

o CMA always ignores the information
o If the migrate and free scanner meet then the cached information will
  be discarded if it's at least 5 seconds since the last time the cache
  was discarded
o If there are a large number of allocation failures, discard the cache.

The time-based heuristic is very clumsy but there are few choices for a
better event.  Depending solely on multiple allocation failures still
allows excessive scanning when THP allocations are failing in quick
succession due to memory pressure.  Waiting until memory pressure is
relieved would cause compaction to continually fail instead of using
reclaim/compaction to try allocate the page.  The time-based mechanism is
clumsy but a better option is not obvious.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Richard Davies &lt;richard@arachsys.com&gt;
Cc: Shaohua Li &lt;shli@kernel.org&gt;
Cc: Avi Kivity &lt;avi@redhat.com&gt;
Acked-by: Rafael Aquini &lt;aquini@redhat.com&gt;
Cc: Fengguang Wu &lt;fengguang.wu@intel.com&gt;
Cc: Michal Nazarewicz &lt;mina86@mina86.com&gt;
Cc: Bartlomiej Zolnierkiewicz &lt;b.zolnierkie@samsung.com&gt;
Cc: Kyungmin Park &lt;kyungmin.park@samsung.com&gt;
Cc: Mark Brown &lt;broonie@opensource.wolfsonmicro.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When compaction was implemented it was known that scanning could
potentially be excessive.  The ideal was that a counter be maintained for
each pageblock but maintaining this information would incur a severe
penalty due to a shared writable cache line.  It has reached the point
where the scanning costs are a serious problem, particularly on
long-lived systems where a large process starts and allocates a large
number of THPs at the same time.

Instead of using a shared counter, this patch adds another bit to the
pageblock flags called PG_migrate_skip.  If a pageblock is scanned by
either migrate or free scanner and 0 pages were isolated, the pageblock is
marked to be skipped in the future.  When scanning, this bit is checked
before any scanning takes place and the block skipped if set.

The main difficulty with a patch like this is "when to ignore the cached
information?" If it's ignored too often, the scanning rates will still be
excessive.  If the information is too stale then allocations will fail
that might have otherwise succeeded.  In this patch

o CMA always ignores the information
o If the migrate and free scanner meet then the cached information will
  be discarded if it's at least 5 seconds since the last time the cache
  was discarded
o If there are a large number of allocation failures, discard the cache.

The time-based heuristic is very clumsy but there are few choices for a
better event.  Depending solely on multiple allocation failures still
allows excessive scanning when THP allocations are failing in quick
succession due to memory pressure.  Waiting until memory pressure is
relieved would cause compaction to continually fail instead of using
reclaim/compaction to try allocate the page.  The time-based mechanism is
clumsy but a better option is not obvious.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Richard Davies &lt;richard@arachsys.com&gt;
Cc: Shaohua Li &lt;shli@kernel.org&gt;
Cc: Avi Kivity &lt;avi@redhat.com&gt;
Acked-by: Rafael Aquini &lt;aquini@redhat.com&gt;
Cc: Fengguang Wu &lt;fengguang.wu@intel.com&gt;
Cc: Michal Nazarewicz &lt;mina86@mina86.com&gt;
Cc: Bartlomiej Zolnierkiewicz &lt;b.zolnierkie@samsung.com&gt;
Cc: Kyungmin Park &lt;kyungmin.park@samsung.com&gt;
Cc: Mark Brown &lt;broonie@opensource.wolfsonmicro.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>revert "mm: have order &gt; 0 compaction start off where it left"</title>
<updated>2012-10-09T07:22:50+00:00</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2012-10-08T23:32:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=753341a4b85ff337487b9959c71c529f522004f4'/>
<id>753341a4b85ff337487b9959c71c529f522004f4</id>
<content type='text'>
This reverts commit 7db8889ab05b ("mm: have order &gt; 0 compaction start
off where it left") and commit de74f1cc ("mm: have order &gt; 0 compaction
start near a pageblock with free pages").  These patches were a good
idea and tests confirmed that they massively reduced the amount of
scanning but the implementation is complex and tricky to understand.  A
later patch will cache what pageblocks should be skipped and
reimplements the concept of compact_cached_free_pfn on top for both
migration and free scanners.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Richard Davies &lt;richard@arachsys.com&gt;
Cc: Shaohua Li &lt;shli@kernel.org&gt;
Cc: Avi Kivity &lt;avi@redhat.com&gt;
Acked-by: Rafael Aquini &lt;aquini@redhat.com&gt;
Acked-by: Minchan Kim &lt;minchan@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This reverts commit 7db8889ab05b ("mm: have order &gt; 0 compaction start
off where it left") and commit de74f1cc ("mm: have order &gt; 0 compaction
start near a pageblock with free pages").  These patches were a good
idea and tests confirmed that they massively reduced the amount of
scanning but the implementation is complex and tricky to understand.  A
later patch will cache what pageblocks should be skipped and
reimplements the concept of compact_cached_free_pfn on top for both
migration and free scanners.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Richard Davies &lt;richard@arachsys.com&gt;
Cc: Shaohua Li &lt;shli@kernel.org&gt;
Cc: Avi Kivity &lt;avi@redhat.com&gt;
Acked-by: Rafael Aquini &lt;aquini@redhat.com&gt;
Acked-by: Minchan Kim &lt;minchan@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: compaction: acquire the zone-&gt;lock as late as possible</title>
<updated>2012-10-09T07:22:49+00:00</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2012-10-08T23:32:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=f40d1e42bb988d2a26e8e111ea4c4c7bac819b7e'/>
<id>f40d1e42bb988d2a26e8e111ea4c4c7bac819b7e</id>
<content type='text'>
Compaction's free scanner acquires the zone-&gt;lock when checking for
PageBuddy pages and isolating them.  It does this even if there are no
PageBuddy pages in the range.

This patch defers acquiring the zone lock for as long as possible.  In the
event there are no free pages in the pageblock then the lock will not be
acquired at all which reduces contention on zone-&gt;lock.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Richard Davies &lt;richard@arachsys.com&gt;
Cc: Shaohua Li &lt;shli@kernel.org&gt;
Cc: Avi Kivity &lt;avi@redhat.com&gt;
Acked-by: Rafael Aquini &lt;aquini@redhat.com&gt;
Acked-by: Minchan Kim &lt;minchan@kernel.org&gt;
Tested-by: Peter Ujfalusi &lt;peter.ujfalusi@ti.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Compaction's free scanner acquires the zone-&gt;lock when checking for
PageBuddy pages and isolating them.  It does this even if there are no
PageBuddy pages in the range.

This patch defers acquiring the zone lock for as long as possible.  In the
event there are no free pages in the pageblock then the lock will not be
acquired at all which reduces contention on zone-&gt;lock.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Richard Davies &lt;richard@arachsys.com&gt;
Cc: Shaohua Li &lt;shli@kernel.org&gt;
Cc: Avi Kivity &lt;avi@redhat.com&gt;
Acked-by: Rafael Aquini &lt;aquini@redhat.com&gt;
Acked-by: Minchan Kim &lt;minchan@kernel.org&gt;
Tested-by: Peter Ujfalusi &lt;peter.ujfalusi@ti.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
