linux.git/mm/page_alloc.c, branch v3.7-rc7

fix incorrect NR_FREE_PAGES accounting (appears like memory leak)

2012-11-21T22:33:16+00:00

There have been some 3.7-rc reports of vm issues, including some kswapd
bugs and, more importantly, some memory "leaks":

	http://www.spinics.net/lists/linux-mm/msg46187.html
	https://bugzilla.kernel.org/show_bug.cgi?id=50181

Commit 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page
immediately when it is made available") took split_free_page() and
reused it for the compaction code.  It does something curious with
capture_free_page() (previously known as split_free_page()):

  int capture_free_page(struct page *page, int alloc_order,
  ...
          __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));

  -       /* Split into individual pages */
  -       set_page_refcounted(page);
  -       split_page(page, order);
  +       if (alloc_order != order)
  +               expand(zone, page, alloc_order, order,
  +                       &zone->free_area[order], migratetype);

Note that expand() puts the pages _back_ in the allocator, but it does
not bump NR_FREE_PAGES.  We "return" 'alloc_order' worth of pages, but
we accounted for removing 'order' in the __mod_zone_page_state() call.

For the old split_page()-style use (order==alloc_order) the bug will not
trigger.  But, when called from the compaction code where we
occasionally get a larger page out of the buddy allocator than we need,
we will run in to this.

This patch simply changes the NR_FREE_PAGES manipulation to the correct
'alloc_order' instead of 'order'.

I've been able to repeatedly trigger this in my testing environment.
The amount "leaked" very closely tracks the imbalance I see in buddy
pages vs.  NR_FREE_PAGES.  I have confirmed that this patch fixes the
imbalance

Signed-off-by: Dave Hansen 
Acked-by: Mel Gorman 
Signed-off-by: Linus Torvalds

revert "mm: fix-up zone present pages"

2012-11-16T22:33:04+00:00

Revert commit 7f1290f2f2a4 ("mm: fix-up zone present pages")

That patch tried to fix a issue when calculating zone->present_pages,
but it caused a regression on 32bit systems with HIGHMEM.  With that
change, reset_zone_present_pages() resets all zone->present_pages to
zero, and fixup_zone_present_pages() is called to recalculate
zone->present_pages when the boot allocator frees core memory pages into
buddy allocator.  Because highmem pages are not freed by bootmem
allocator, all highmem zones' present_pages becomes zero.

Various options for improving the situation are being discussed but for
now, let's return to the 3.6 code.

Cc: Jianguo Wu 
Cc: Jiang Liu 
Cc: Petr Tesarik 
Cc: "Luck, Tony" 
Cc: Mel Gorman 
Cc: Yinghai Lu 
Cc: Minchan Kim 
Cc: Johannes Weiner 
Acked-by: David Rientjes 
Tested-by: Chris Clayton 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg: fix hotplugged memory zone oops

2012-11-16T22:33:04+00:00

When MEMCG is configured on (even when it's disabled by boot option),
when adding or removing a page to/from its lru list, the zone pointer
used for stats updates is nowadays taken from the struct lruvec.  (On
many configurations, calculating zone from page is slower.)

But we have no code to update all the lruvecs (per zone, per memcg) when
a memory node is hotadded.  Here's an extract from the oops which
results when running numactl to bind a program to a newly onlined node:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60
  IP:  __mod_zone_page_state+0x9/0x60
  Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs
  Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0)
  Call Trace:
    __pagevec_lru_add_fn+0xdf/0x140
    pagevec_lru_move_fn+0xb1/0x100
    __pagevec_lru_add+0x1c/0x30
    lru_add_drain_cpu+0xa3/0x130
    lru_add_drain+0x2f/0x40
   ...

The natural solution might be to use a memcg callback whenever memory is
hotadded; but that solution has not been scoped out, and it happens that
we do have an easy location at which to update lruvec->zone.  The lruvec
pointer is discovered either by mem_cgroup_zone_lruvec() or by
mem_cgroup_page_lruvec(), and both of those do know the right zone.

So check and set lruvec->zone in those; and remove the inadequate
attempt to set lruvec->zone from lruvec_init(), which is called before
NODE_DATA(node) has been allocated in such cases.

Ah, there was one exceptionr.  For no particularly good reason,
mem_cgroup_force_empty_list() has its own code for deciding lruvec.
Change it to use the standard mem_cgroup_zone_lruvec() and
mem_cgroup_get_lru_size() too.  In fact it was already safe against such
an oops (the lru lists in danger could only be empty), but we're better
proofed against future changes this way.

I've marked this for stable (3.6) since we introduced the problem in 3.5
(now closed to stable); but I have no idea if this is the only fix
needed to get memory hotadd working with memcg in 3.6, and received no
answer when I enquired twice before.

Reported-by: Tang Chen 
Signed-off-by: Hugh Dickins 
Acked-by: Johannes Weiner 
Acked-by: KAMEZAWA Hiroyuki 
Cc: Konstantin Khlebnikov 
Cc: Wen Congyang 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, numa: avoid setting zone_reclaim_mode unless a node is sufficiently distant

2012-10-25T21:37:53+00:00

Commit 957f822a0ab9 ("mm, numa: reclaim from all nodes within reclaim
distance") caused zone_reclaim_mode to be set for all systems where two
nodes are within RECLAIM_DISTANCE of each other.  This is the opposite
of what we actually want: zone_reclaim_mode should be set if two nodes
are sufficiently distant.

Signed-off-by: David Rientjes 
Reported-by: Julian Wollrath 
Tested-by: Julian Wollrath 
Cc: Hugh Dickins 
Cc: Patrik Kullman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/page_alloc.c:alloc_contig_range(): return early for err path

2012-10-25T21:37:52+00:00

If start_isolate_page_range() failed, unset_migratetype_isolate() has been
done inside it.

Signed-off-by: Bob Liu 
Cc: Ni zhan Chen 
Cc: Marek Szyprowski 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

cma: decrease cc.nr_migratepages after reclaiming pagelist

2012-10-09T07:23:01+00:00

reclaim_clean_pages_from_list() reclaims clean pages before migration so
cc.nr_migratepages should be updated.  Currently, there is no problem but
it can be wrong if we try to use the value in future.

Signed-off-by: Minchan Kim 
Acked-by: Mel Gorman 
Cc: Michal Nazarewicz 
Cc: Bartlomiej Zolnierkiewicz 
Cc: Marek Szyprowski 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

CMA: migrate mlocked pages

2012-10-09T07:23:00+00:00

Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
contiguous memory space.

This patch makes mlocked pages be migrated out.  Of course, it can affect
realtime processes but in CMA usecase, contiguous memory allocation failing
is far worse than access latency to an mlocked page being variable while
CMA is running.  If someone wants to make the system realtime, he shouldn't
enable CMA because stalls can still happen at random times.

[akpm@linux-foundation.org: tweak comment text, per Mel]
Signed-off-by: Minchan Kim 
Acked-by: Mel Gorman 
Cc: Michal Nazarewicz 
Cc: Bartlomiej Zolnierkiewicz 
Cc: Marek Szyprowski 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memory-hotplug: fix zone stat mismatch

2012-10-09T07:22:59+00:00

During memory-hotplug, I found NR_ISOLATED_[ANON|FILE] are increasing,
causing the kernel to hang.  When the system doesn't have enough free
pages, it enters reclaim but never reclaim any pages due to
too_many_isolated()==true and loops forever.

The cause is that when we do memory-hotadd after memory-remove,
__zone_pcp_update() clears a zone's ZONE_STAT_ITEMS in setup_pageset()
although the vm_stat_diff of all CPUs still have values.

In addtion, when we offline all pages of the zone, we reset them in
zone_pcp_reset without draining so we loss some zone stat item.

Reviewed-by: Wen Congyang 
Signed-off-by: Minchan Kim 
Cc: Kamezawa Hiroyuki 
Cc: Yasuaki Ishimatsu 
Cc: KOSAKI Motohiro 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, numa: reclaim from all nodes within reclaim distance

2012-10-09T07:22:56+00:00

RECLAIM_DISTANCE represents the distance between nodes at which it is
deemed too costly to allocate from; it's preferred to try to reclaim from
a local zone before falling back to allocating on a remote node with such
a distance.

To do this, zone_reclaim_mode is set if the distance between any two
nodes on the system is greather than this distance.  This, however, ends
up causing the page allocator to reclaim from every zone regardless of
its affinity.

What we really want is to reclaim only from zones that are closer than
RECLAIM_DISTANCE.  This patch adds a nodemask to each node that
represents the set of nodes that are within this distance.  During the
zone iteration, if the bit for a zone's node is set for the local node,
then reclaim is attempted; otherwise, the zone is skipped.

[akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
Signed-off-by: David Rientjes 
Cc: Mel Gorman 
Cc: Minchan Kim 
Cc: KAMEZAWA Hiroyuki 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: remove free_page_mlock

2012-10-09T07:22:56+00:00

We should not be seeing non-0 unevictable_pgs_mlockfreed any longer.  So
remove free_page_mlock() from the page freeing paths: __PG_MLOCKED is
already in PAGE_FLAGS_CHECK_AT_FREE, so free_pages_check() will now be
checking it, reporting "BUG: Bad page state" if it's ever found set.
Comment UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed always 0.

Signed-off-by: Hugh Dickins 
Acked-by: Mel Gorman 
Cc: Rik van Riel 
Cc: Johannes Weiner 
Cc: Michel Lespinasse 
Cc: Ying Han 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds