linux-stable.git/mm, branch linux-3.3.y

mm: mempolicy: Let vma_merge and vma_split handle vma->vm_policy linkages

2012-06-01T07:15:46+00:00

commit 05f144a0d5c2207a0349348127f996e104ad7404 upstream.

Dave Jones' system call fuzz testing tool "trinity" triggered the
following bug error with slab debugging enabled

    =============================================================================
    BUG numa_policy (Not tainted): Poison overwritten
    -----------------------------------------------------------------------------

    INFO: 0xffff880146498250-0xffff880146498250. First byte 0x6a instead of 0x6b
    INFO: Allocated in mpol_new+0xa3/0x140 age=46310 cpu=6 pid=32154
     __slab_alloc+0x3d3/0x445
     kmem_cache_alloc+0x29d/0x2b0
     mpol_new+0xa3/0x140
     sys_mbind+0x142/0x620
     system_call_fastpath+0x16/0x1b
    INFO: Freed in __mpol_put+0x27/0x30 age=46268 cpu=6 pid=32154
     __slab_free+0x2e/0x1de
     kmem_cache_free+0x25a/0x260
     __mpol_put+0x27/0x30
     remove_vma+0x68/0x90
     exit_mmap+0x118/0x140
     mmput+0x73/0x110
     exit_mm+0x108/0x130
     do_exit+0x162/0xb90
     do_group_exit+0x4f/0xc0
     sys_exit_group+0x17/0x20
     system_call_fastpath+0x16/0x1b
    INFO: Slab 0xffffea0005192600 objects=27 used=27 fp=0x          (null) flags=0x20000000004080
    INFO: Object 0xffff880146498250 @offset=592 fp=0xffff88014649b9d0

This implied a reference counting bug and the problem happened during
mbind().

mbind() applies a new memory policy to a range and uses mbind_range() to
merge existing VMAs or split them as necessary.  In the event of splits,
mpol_dup() will allocate a new struct mempolicy and maintain existing
reference counts whose rules are documented in
Documentation/vm/numa_memory_policy.txt .

The problem occurs with shared memory policies.  The vm_op->set_policy
increments the reference count if necessary and split_vma() and
vma_merge() have already handled the existing reference counts.
However, policy_vma() screws it up by replacing an existing
vma->vm_policy with one that potentially has the wrong reference count
leading to a premature free.  This patch removes the damage caused by
policy_vma().

With this patch applied Dave's trinity tool runs an mbind test for 5
minutes without error.  /proc/slabinfo reported that there are no
numa_policy or shared_policy_node objects allocated after the test
completed and the shared memory region was deleted.

Signed-off-by: Mel Gorman 
Cc: Dave Jones 
Cc: KOSAKI Motohiro 
Cc: Stephen Wilson 
Cc: Christoph Lameter 
Cc: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

swap: don't do discard if no discard option added

2012-06-01T07:15:45+00:00

commit 052b1987faca3606109d88d96bce124851f7c4c2 upstream.

When swapon() was not passed the SWAP_FLAG_DISCARD option, sys_swapon()
will still perform a discard operation.  This can cause problems if
discard is slow or buggy.

Reverse the order of the check so that a discard operation is performed
only if the sys_swapon() caller is attempting to enable discard.

Signed-off-by: Shaohua Li 
Reported-by: Holger Kiehl 
Tested-by: Holger Kiehl 
Cc: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Cc: Konrad Rzeszutek Wilk 
Cc: William Dauchy 
Signed-off-by: Greg Kroah-Hartman

memcg: free spare array to avoid memory leak

2012-05-21T17:46:22+00:00

commit 8c7577637ca31385e92769a77e2ab5b428e8b99c upstream.

When the last event is unregistered, there is no need to keep the spare
array anymore.  So free it to avoid memory leak.

Signed-off-by: Sha Zhengju 
Acked-by: KAMEZAWA Hiroyuki 
Reviewed-by: Kirill A. Shutemov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: nobootmem: fix sign extend problem in __free_pages_memory()

2012-05-21T17:46:20+00:00

commit 6bc2e853c6b46a6041980d58200ad9b0a73a60ff upstream.

Systems with 8 TBytes of memory or greater can hit a problem where only
the the first 8 TB of memory shows up.  This is due to "int i" being
smaller than "unsigned long start_aligned", causing the high bits to be
dropped.

The fix is to change `i' to unsigned long to match start_aligned
and end_aligned.

Thanks to Jack Steiner for assistance tracking this down.

Signed-off-by: Russ Anderson 
Cc: Jack Steiner 
Cc: Johannes Weiner 
Cc: Tejun Heo 
Cc: David S. Miller 
Cc: Yinghai Lu 
Cc: Gavin Shan 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

hugetlb: prevent BUG_ON in hugetlb_fault() -> hugetlb_cow()

2012-05-21T17:46:20+00:00

commit 4998a6c0edce7fae9c0a5463f6ec3fa585258ee7 upstream.

Commit 66aebce747eaf ("hugetlb: fix race condition in hugetlb_fault()")
added code to avoid a race condition by elevating the page refcount in
hugetlb_fault() while calling hugetlb_cow().

However, one code path in hugetlb_cow() includes an assertion that the
page count is 1, whereas it may now also have the value 2 in this path.

The consensus is that this BUG_ON has served its purpose, so rather than
extending it to cover both cases, we just remove it.

Signed-off-by: Chris Metcalf 
Acked-by: Mel Gorman 
Acked-by: Hillf Danton 
Acked-by: Hugh Dickins 
Cc: Michal Hocko 
Cc: KAMEZAWA Hiroyuki 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

kmemleak: Fix the kmemleak tracking of the percpu areas with !SMP

2012-05-21T17:46:19+00:00

commit 100d13c3b5b9410f604b86f5e0a34da64b8cf659 upstream.

Kmemleak tracks the percpu allocations via a specific API and the
originally allocated areas must be removed from kmemleak (via
kmemleak_free). The code was already doing this for SMP systems.

Reported-by: Sami Liedes 
Cc: Tejun Heo 
Cc: Christoph Lameter 
Signed-off-by: Catalin Marinas 
Signed-off-by: Tejun Heo 
Signed-off-by: Greg Kroah-Hartman

percpu: pcpu_embed_first_chunk() should free unused parts after all allocs are complete

2012-05-21T17:46:19+00:00

commit 42b64281453249dac52861f9b97d18552a7ec62b upstream.

pcpu_embed_first_chunk() allocates memory for each node, copies percpu
data and frees unused portions of it before proceeding to the next
group.  This assumes that allocations for different nodes doesn't
overlap; however, depending on memory topology, the bootmem allocator
may end up allocating memory from a different node than the requested
one which may overlap with the portion freed from one of the previous
percpu areas.  This leads to percpu groups for different nodes
overlapping which is a serious bug.

This patch separates out copy & partial free from the allocation loop
such that all allocations are complete before partial frees happen.

This also fixes overlapping frees which could happen on allocation
failure path - out_free_areas path frees whole groups but the groups
could have portions freed at that point.

Signed-off-by: Tejun Heo 
Reported-by: "Pavel V. Panteleev" 
Tested-by: "Pavel V. Panteleev" 
LKML-Reference: 
Signed-off-by: Greg Kroah-Hartman

hugepages: fix use after free bug in "quota" handling

2012-05-12T16:32:21+00:00

commit 90481622d75715bfcb68501280a917dbfe516029 upstream.

hugetlbfs_{get,put}_quota() are badly named.  They don't interact with the
general quota handling code, and they don't much resemble its behaviour.
Rather than being about maintaining limits on on-disk block usage by
particular users, they are instead about maintaining limits on in-memory
page usage (including anonymous MAP_PRIVATE copied-on-write pages)
associated with a particular hugetlbfs filesystem instance.

Worse, they work by having callbacks to the hugetlbfs filesystem code from
the low-level page handling code, in particular from free_huge_page().
This is a layering violation of itself, but more importantly, if the
kernel does a get_user_pages() on hugepages (which can happen from KVM
amongst others), then the free_huge_page() can be delayed until after the
associated inode has already been freed.  If an unmount occurs at the
wrong time, even the hugetlbfs superblock where the "quota" limits are
stored may have been freed.

Andrew Barry proposed a patch to fix this by having hugepages, instead of
storing a pointer to their address_space and reaching the superblock from
there, had the hugepages store pointers directly to the superblock,
bumping the reference count as appropriate to avoid it being freed.
Andrew Morton rejected that version, however, on the grounds that it made
the existing layering violation worse.

This is a reworked version of Andrew's patch, which removes the extra, and
some of the existing, layering violation.  It works by introducing the
concept of a hugepage "subpool" at the lower hugepage mm layer - that is a
finite logical pool of hugepages to allocate from.  hugetlbfs now creates
a subpool for each filesystem instance with a page limit set, and a
pointer to the subpool gets added to each allocated hugepage, instead of
the address_space pointer used now.  The subpool has its own lifetime and
is only freed once all pages in it _and_ all other references to it (i.e.
superblocks) are gone.

subpools are optional - a NULL subpool pointer is taken by the code to
mean that no subpool limits are in effect.

Previous discussion of this bug found in:  "Fix refcounting in hugetlbfs
quota handling.". See:  https://lkml.org/lkml/2011/8/11/28 or
http://marc.info/?l=linux-mm&m=126928970510627&w=1

v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
alloc_huge_page() - since it already takes the vma, it is not necessary.

Signed-off-by: Andrew Barry 
Signed-off-by: David Gibson 
Cc: Hugh Dickins 
Cc: Mel Gorman 
Cc: Minchan Kim 
Cc: Hillf Danton 
Cc: Paul Mackerras 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: fix s390 BUG by __set_page_dirty_no_writeback on swap

2012-04-27T17:16:49+00:00

commit aca50bd3b4c4bb5528a1878158ba7abce41de534 upstream.

Mel reports a BUG_ON(slot == NULL) in radix_tree_tag_set() on s390
3.0.13: called from __set_page_dirty_nobuffers() when page_remove_rmap()
tries to transfer dirty flag from s390 storage key to struct page and
radix_tree.

That would be because of reclaim's shrink_page_list() calling
add_to_swap() on this page at the same time: first PageSwapCache is set
(causing page_mapping(page) to appear as &swapper_space), then
page->private set, then tree_lock taken, then page inserted into
radix_tree - so there's an interval before taking the lock when the
radix_tree slot is empty.

We could fix this by moving __add_to_swap_cache()'s spin_lock_irq up
before the SetPageSwapCache.  But a better fix is simply to do what's
five years overdue: Ken Chen introduced __set_page_dirty_no_writeback()
(if !PageDirty TestSetPageDirty) for tmpfs to skip all the radix_tree
overhead, and swap is just the same - it ignores the radix_tree tag, and
does not participate in dirty page accounting, so should be using
__set_page_dirty_no_writeback() too.

s390 testing now confirms that this does indeed fix the problem.

Reported-by: Mel Gorman 
Signed-off-by: Hugh Dickins 
Acked-by: Mel Gorman 
Cc: Andrew Morton 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Rik van Riel 
Cc: Ken Chen 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

memblock: memblock should be able to handle zero length operations

2012-04-27T17:16:34+00:00

commit b3dc627cabb33fc95f93da78457770c1b2a364d2 upstream.

Commit 24aa07882b ("memblock, x86: Replace memblock_x86_reserve/
free_range() with generic ones") replaced x86 specific memblock
operations with the generic ones; unfortunately, it lost zero length
operation handling in the process making the kernel panic if somebody
tries to reserve zero length area.

There isn't much to be gained by being cranky to zero length operations
and panicking is almost the worst response.  Drop the BUG_ON() in
memblock_reserve() and update memblock_add_region/isolate_range() so
that all zero length operations are handled as noops.

Signed-off-by: Tejun Heo 
Reported-by: Valere Monseur 
Bisected-by: Joseph Freeman 
Tested-by: Joseph Freeman 
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=43098
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman