<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/mm, branch v3.0-rc4</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>mm: avoid anon_vma_chain allocation under anon_vma lock</title>
<updated>2011-06-18T02:24:11+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2011-06-18T02:05:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=dd34739c03f2f9a79403d33419c2e61e11b4c403'/>
<id>dd34739c03f2f9a79403d33419c2e61e11b4c403</id>
<content type='text'>
Hugh Dickins points out that lockdep (correctly) spots a potential
deadlock on the anon_vma lock, because we now do a GFP_KERNEL allocation
of anon_vma_chain while doing anon_vma_clone().  The problem is that
page reclaim will want to take the anon_vma lock of any anonymous pages
that it will try to reclaim.

So re-organize the code in anon_vma_clone() slightly: first do just a
GFP_NOWAIT allocation, which will usually work fine.  But if that fails,
let's just drop the lock and re-do the allocation, now with GFP_KERNEL.

End result: not only do we avoid the locking problem, this also ends up
getting better concurrency in case the allocation does need to block.
Tim Chen reports that with all these anon_vma locking tweaks, we're now
almost back up to the spinlock performance.

Reported-and-tested-by: Hugh Dickins &lt;hughd@google.com&gt;
Tested-by: Tim Chen &lt;tim.c.chen@linux.intel.com&gt;
Cc: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: Andi Kleen &lt;ak@linux.intel.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Hugh Dickins points out that lockdep (correctly) spots a potential
deadlock on the anon_vma lock, because we now do a GFP_KERNEL allocation
of anon_vma_chain while doing anon_vma_clone().  The problem is that
page reclaim will want to take the anon_vma lock of any anonymous pages
that it will try to reclaim.

So re-organize the code in anon_vma_clone() slightly: first do just a
GFP_NOWAIT allocation, which will usually work fine.  But if that fails,
let's just drop the lock and re-do the allocation, now with GFP_KERNEL.

End result: not only do we avoid the locking problem, this also ends up
getting better concurrency in case the allocation does need to block.
Tim Chen reports that with all these anon_vma locking tweaks, we're now
almost back up to the spinlock performance.

Reported-and-tested-by: Hugh Dickins &lt;hughd@google.com&gt;
Tested-by: Tim Chen &lt;tim.c.chen@linux.intel.com&gt;
Cc: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: Andi Kleen &lt;ak@linux.intel.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: avoid repeated anon_vma lock/unlock sequences in unlink_anon_vmas()</title>
<updated>2011-06-18T02:23:52+00:00</updated>
<author>
<name>Peter Zijlstra</name>
<email>peterz@infradead.org</email>
</author>
<published>2011-06-17T11:54:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=eee2acbae95555006307395d8a6c91452d62851d'/>
<id>eee2acbae95555006307395d8a6c91452d62851d</id>
<content type='text'>
This matches the anon_vma_clone() case, and uses the same lock helper
functions.  Because of the need to potentially release the anon_vma's,
it's a bit more complex, though.

We traverse the 'vma-&gt;anon_vma_chain' in two phases: the first loop gets
the anon_vma lock (with the helper function that only takes the lock
once for the whole loop), and removes any entries that don't need any
more processing.

The second phase just traverses the remaining list entries (without
holding the anon_vma lock), and does any actual freeing of the
anon_vma's that is required.

Signed-off-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Tested-by: Hugh Dickins &lt;hughd@google.com&gt;
Tested-by: Tim Chen &lt;tim.c.chen@linux.intel.com&gt;
Cc: Andi Kleen &lt;ak@linux.intel.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This matches the anon_vma_clone() case, and uses the same lock helper
functions.  Because of the need to potentially release the anon_vma's,
it's a bit more complex, though.

We traverse the 'vma-&gt;anon_vma_chain' in two phases: the first loop gets
the anon_vma lock (with the helper function that only takes the lock
once for the whole loop), and removes any entries that don't need any
more processing.

The second phase just traverses the remaining list entries (without
holding the anon_vma lock), and does any actual freeing of the
anon_vma's that is required.

Signed-off-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Tested-by: Hugh Dickins &lt;hughd@google.com&gt;
Tested-by: Tim Chen &lt;tim.c.chen@linux.intel.com&gt;
Cc: Andi Kleen &lt;ak@linux.intel.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: avoid repeated anon_vma lock/unlock sequences in anon_vma_clone()</title>
<updated>2011-06-18T02:20:49+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2011-06-17T03:44:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=bb4aa39676f73b4657b3edd893ae83881c430c0c'/>
<id>bb4aa39676f73b4657b3edd893ae83881c430c0c</id>
<content type='text'>
In anon_vma_clone() we traverse the vma-&gt;anon_vma_chain of the source
vma, locking the anon_vma for each entry.

But they are all going to have the same root entry, which means that
we're locking and unlocking the same lock over and over again.  Which is
expensive in locked operations, but can get _really_ expensive when that
root entry sees any kind of lock contention.

In fact, Tim Chen reports a big performance regression due to this: when
we switched to use a mutex instead of a spinlock, the contention case
gets much worse.

So to alleviate this all, this commit creates a small helper function
(lock_anon_vma_root()) that can be used to take the lock just once
rather than taking and releasing it over and over again.

We still have the same "take the lock and release" it behavior in the
exit path (in unlink_anon_vmas()), but that one is a bit harder to fix
since we're actually freeing the anon_vma entries as we go, and that
will touch the lock too.

Reported-and-tested-by: Tim Chen &lt;tim.c.chen@linux.intel.com&gt;
Tested-by: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: Andi Kleen &lt;ak@linux.intel.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In anon_vma_clone() we traverse the vma-&gt;anon_vma_chain of the source
vma, locking the anon_vma for each entry.

But they are all going to have the same root entry, which means that
we're locking and unlocking the same lock over and over again.  Which is
expensive in locked operations, but can get _really_ expensive when that
root entry sees any kind of lock contention.

In fact, Tim Chen reports a big performance regression due to this: when
we switched to use a mutex instead of a spinlock, the contention case
gets much worse.

So to alleviate this all, this commit creates a small helper function
(lock_anon_vma_root()) that can be used to take the lock just once
rather than taking and releasing it over and over again.

We still have the same "take the lock and release" it behavior in the
exit path (in unlink_anon_vmas()), but that one is a bit harder to fix
since we're actually freeing the anon_vma entries as we go, and that
will touch the lock too.

Reported-and-tested-by: Tim Chen &lt;tim.c.chen@linux.intel.com&gt;
Tested-by: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: Andi Kleen &lt;ak@linux.intel.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>migrate: don't account swapcache as shmem</title>
<updated>2011-06-16T22:01:24+00:00</updated>
<author>
<name>Andrea Arcangeli</name>
<email>aarcange@redhat.com</email>
</author>
<published>2011-06-16T19:56:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=99a15e21d96f6857dafab1e5167e5e8183215c9c'/>
<id>99a15e21d96f6857dafab1e5167e5e8183215c9c</id>
<content type='text'>
swapcache will reach the below code path in migrate_page_move_mapping,
and swapcache is accounted as NR_FILE_PAGES but it's not accounted as
NR_SHMEM.

Hugh pointed out we must use PageSwapCache instead of comparing
mapping to &amp;swapper_space, to avoid build failure with CONFIG_SWAP=n.

Signed-off-by: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Acked-by: Hugh Dickins &lt;hughd@google.com&gt;
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
swapcache will reach the below code path in migrate_page_move_mapping,
and swapcache is accounted as NR_FILE_PAGES but it's not accounted as
NR_SHMEM.

Hugh pointed out we must use PageSwapCache instead of comparing
mapping to &amp;swapper_space, to avoid build failure with CONFIG_SWAP=n.

Signed-off-by: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Acked-by: Hugh Dickins &lt;hughd@google.com&gt;
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: get rid of the most spurious find_vma_prev() users</title>
<updated>2011-06-16T07:35:09+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2011-06-16T07:35:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=9be34c9d526c305efb332ad53460b57d5f8edb3e'/>
<id>9be34c9d526c305efb332ad53460b57d5f8edb3e</id>
<content type='text'>
We have some users of this function that date back to before the vma
list was doubly linked, and just are silly.  These days, you can find
the previous vma by just following the vma-&gt;vm_prev pointer.

In some cases you don't need any find_vma() lookup at all, and in other
cases you're better off with the regular "find_vma()" that uses the vma
cache front-end lookup.

Some "find_vma_prev()" users are still valid, though.  For example, in
the case of a stack that grows up, it can be the case that we don't find
any 'vma' at all (because we're looking up an address that is past the
last vma), and that the stack that we want to grow is the 'prev' vma.

But that kind of special case aside, we generally should prefer to use
'find_vma()'.

Noticed due to a totally unrelated POWER memory corruption bug that just
happened to hit in 'find_vma_prev()' and made me go "Hmm - why are we
using that function here?".

Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We have some users of this function that date back to before the vma
list was doubly linked, and just are silly.  These days, you can find
the previous vma by just following the vma-&gt;vm_prev pointer.

In some cases you don't need any find_vma() lookup at all, and in other
cases you're better off with the regular "find_vma()" that uses the vma
cache front-end lookup.

Some "find_vma_prev()" users are still valid, though.  For example, in
the case of a stack that grows up, it can be the case that we don't find
any 'vma' at all (because we're looking up an address that is past the
last vma), and that the stack that we want to grow is the 'prev' vma.

But that kind of special case aside, we generally should prefer to use
'find_vma()'.

Noticed due to a totally unrelated POWER memory corruption bug that just
happened to hit in 'find_vma_prev()' and made me go "Hmm - why are we
using that function here?".

Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ksm: fix NULL pointer dereference in scan_get_next_rmap_item()</title>
<updated>2011-06-16T03:04:02+00:00</updated>
<author>
<name>Hugh Dickins</name>
<email>hughd@google.com</email>
</author>
<published>2011-06-15T22:08:58+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=2b472611a32a72f4a118c069c2d62a1a3f087afd'/>
<id>2b472611a32a72f4a118c069c2d62a1a3f087afd</id>
<content type='text'>
Andrea Righi reported a case where an exiting task can race against
ksmd::scan_get_next_rmap_item (http://lkml.org/lkml/2011/6/1/742) easily
triggering a NULL pointer dereference in ksmd.

ksm_scan.mm_slot == &amp;ksm_mm_head with only one registered mm

CPU 1 (__ksm_exit)		CPU 2 (scan_get_next_rmap_item)
 				list_empty() is false
lock				slot == &amp;ksm_mm_head
list_del(slot-&gt;mm_list)
(list now empty)
unlock
				lock
				slot = list_entry(slot-&gt;mm_list.next)
				(list is empty, so slot is still ksm_mm_head)
				unlock
				slot-&gt;mm == NULL ... Oops

Close this race by revalidating that the new slot is not simply the list
head again.

Andrea's test case:

#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;
#include &lt;unistd.h&gt;
#include &lt;sys/mman.h&gt;

#define BUFSIZE getpagesize()

int main(int argc, char **argv)
{
	void *ptr;

	if (posix_memalign(&amp;ptr, getpagesize(), BUFSIZE) &lt; 0) {
		perror("posix_memalign");
		exit(1);
	}
	if (madvise(ptr, BUFSIZE, MADV_MERGEABLE) &lt; 0) {
		perror("madvise");
		exit(1);
	}
	*(char *)NULL = 0;

	return 0;
}

Reported-by: Andrea Righi &lt;andrea@betterlinux.com&gt;
Tested-by: Andrea Righi &lt;andrea@betterlinux.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Signed-off-by: Chris Wright &lt;chrisw@sous-sol.org&gt;
Cc: &lt;stable@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Andrea Righi reported a case where an exiting task can race against
ksmd::scan_get_next_rmap_item (http://lkml.org/lkml/2011/6/1/742) easily
triggering a NULL pointer dereference in ksmd.

ksm_scan.mm_slot == &amp;ksm_mm_head with only one registered mm

CPU 1 (__ksm_exit)		CPU 2 (scan_get_next_rmap_item)
 				list_empty() is false
lock				slot == &amp;ksm_mm_head
list_del(slot-&gt;mm_list)
(list now empty)
unlock
				lock
				slot = list_entry(slot-&gt;mm_list.next)
				(list is empty, so slot is still ksm_mm_head)
				unlock
				slot-&gt;mm == NULL ... Oops

Close this race by revalidating that the new slot is not simply the list
head again.

Andrea's test case:

#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;
#include &lt;unistd.h&gt;
#include &lt;sys/mman.h&gt;

#define BUFSIZE getpagesize()

int main(int argc, char **argv)
{
	void *ptr;

	if (posix_memalign(&amp;ptr, getpagesize(), BUFSIZE) &lt; 0) {
		perror("posix_memalign");
		exit(1);
	}
	if (madvise(ptr, BUFSIZE, MADV_MERGEABLE) &lt; 0) {
		perror("madvise");
		exit(1);
	}
	*(char *)NULL = 0;

	return 0;
}

Reported-by: Andrea Righi &lt;andrea@betterlinux.com&gt;
Tested-by: Andrea Righi &lt;andrea@betterlinux.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Signed-off-by: Chris Wright &lt;chrisw@sous-sol.org&gt;
Cc: &lt;stable@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: compaction: abort compaction if too many pages are isolated and caller is asynchronous V2</title>
<updated>2011-06-16T03:04:02+00:00</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2011-06-15T22:08:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=f9e35b3b41f47c4e17d8132edbcab305a6aaa4b0'/>
<id>f9e35b3b41f47c4e17d8132edbcab305a6aaa4b0</id>
<content type='text'>
Asynchronous compaction is used when promoting to huge pages.  This is all
very nice but if there are a number of processes in compacting memory, a
large number of pages can be isolated.  An "asynchronous" process can
stall for long periods of time as a result with a user reporting that
firefox can stall for 10s of seconds.  This patch aborts asynchronous
compaction if too many pages are isolated as it's better to fail a
hugepage promotion than stall a process.

[minchan.kim@gmail.com: return COMPACT_PARTIAL for abort]
Reported-and-tested-by: Ury Stankevich &lt;urykhy@gmail.com&gt;
Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Reviewed-by: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Reviewed-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Asynchronous compaction is used when promoting to huge pages.  This is all
very nice but if there are a number of processes in compacting memory, a
large number of pages can be isolated.  An "asynchronous" process can
stall for long periods of time as a result with a user reporting that
firefox can stall for 10s of seconds.  This patch aborts asynchronous
compaction if too many pages are isolated as it's better to fail a
hugepage promotion than stall a process.

[minchan.kim@gmail.com: return COMPACT_PARTIAL for abort]
Reported-and-tested-by: Ury Stankevich &lt;urykhy@gmail.com&gt;
Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Reviewed-by: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Reviewed-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: vmscan: do not use page_count without a page pin</title>
<updated>2011-06-16T03:04:02+00:00</updated>
<author>
<name>Andrea Arcangeli</name>
<email>aarcange@redhat.com</email>
</author>
<published>2011-06-15T22:08:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=d179e84ba5da1d0024087d1759a2938817a00f3f'/>
<id>d179e84ba5da1d0024087d1759a2938817a00f3f</id>
<content type='text'>
It is unsafe to run page_count during the physical pfn scan because
compound_head could trip on a dangling pointer when reading
page-&gt;first_page if the compound page is being freed by another CPU.

[mgorman@suse.de: split out patch]
Signed-off-by: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Reviewed-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Reviewed-by: Minchan Kim &lt;minchan.kim@gmail.com&gt;

Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
It is unsafe to run page_count during the physical pfn scan because
compound_head could trip on a dangling pointer when reading
page-&gt;first_page if the compound page is being freed by another CPU.

[mgorman@suse.de: split out patch]
Signed-off-by: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Reviewed-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Reviewed-by: Minchan Kim &lt;minchan.kim@gmail.com&gt;

Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: compaction: ensure that the compaction free scanner does not move to the next zone</title>
<updated>2011-06-16T03:04:02+00:00</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2011-06-15T22:08:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=7454f4ba40b419eb999a3c61a99da662bf1a2bb8'/>
<id>7454f4ba40b419eb999a3c61a99da662bf1a2bb8</id>
<content type='text'>
Compaction works with two scanners, a migration and a free scanner.  When
the scanners crossover, migration within the zone is complete.  The
location of the scanner is recorded on each cycle to avoid excesive
scanning.

When a zone is small and mostly reserved, it's very easy for the migration
scanner to be close to the end of the zone.  Then the following situation
can occurs

  o migration scanner isolates some pages near the end of the zone
  o free scanner starts at the end of the zone but finds that the
    migration scanner is already there
  o free scanner gets reinitialised for the next cycle as
    cc-&gt;migrate_pfn + pageblock_nr_pages
    moving the free scanner into the next zone
  o migration scanner moves into the next zone

When this happens, NR_ISOLATED accounting goes haywire because some of the
accounting happens against the wrong zone.  One zones counter remains
positive while the other goes negative even though the overall global
count is accurate.  This was reported on X86-32 with !SMP because !SMP
allows the negative counters to be visible.  The fact that it is the bug
should theoritically be possible there.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Reviewed-by: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Reviewed-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Compaction works with two scanners, a migration and a free scanner.  When
the scanners crossover, migration within the zone is complete.  The
location of the scanner is recorded on each cycle to avoid excesive
scanning.

When a zone is small and mostly reserved, it's very easy for the migration
scanner to be close to the end of the zone.  Then the following situation
can occurs

  o migration scanner isolates some pages near the end of the zone
  o free scanner starts at the end of the zone but finds that the
    migration scanner is already there
  o free scanner gets reinitialised for the next cycle as
    cc-&gt;migrate_pfn + pageblock_nr_pages
    moving the free scanner into the next zone
  o migration scanner moves into the next zone

When this happens, NR_ISOLATED accounting goes haywire because some of the
accounting happens against the wrong zone.  One zones counter remains
positive while the other goes negative even though the overall global
count is accurate.  This was reported on X86-32 with !SMP because !SMP
allows the negative counters to be visible.  The fact that it is the bug
should theoritically be possible there.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Reviewed-by: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Reviewed-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>compaction: checks correct fragmentation index</title>
<updated>2011-06-16T03:04:02+00:00</updated>
<author>
<name>Shaohua Li</name>
<email>shaohua.li@intel.com</email>
</author>
<published>2011-06-15T22:08:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=a582a738c763e106f47eab24b8146c698a9c700b'/>
<id>a582a738c763e106f47eab24b8146c698a9c700b</id>
<content type='text'>
fragmentation_index() returns -1000 when the allocation might succeed
This doesn't match the comment and code in compaction_suitable(). I
thought compaction_suitable should return COMPACT_PARTIAL in -1000
case, because in this case allocation could succeed depending on
watermarks.

The impact of this is that compaction starts and compact_finished() is
called which rechecks the watermarks and the free lists.  It should have
the same result in that compaction should not start but is more expensive.

Acked-by: Mel Gorman &lt;mel@csn.ul.ie&gt;
Signed-off-by: Shaohua Li &lt;shaohua.li@intel.com&gt;
Cc: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
fragmentation_index() returns -1000 when the allocation might succeed
This doesn't match the comment and code in compaction_suitable(). I
thought compaction_suitable should return COMPACT_PARTIAL in -1000
case, because in this case allocation could succeed depending on
watermarks.

The impact of this is that compaction starts and compact_finished() is
called which rechecks the watermarks and the free lists.  It should have
the same result in that compaction should not start but is more expensive.

Acked-by: Mel Gorman &lt;mel@csn.ul.ie&gt;
Signed-off-by: Shaohua Li &lt;shaohua.li@intel.com&gt;
Cc: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
