linux-stable.git/mm, branch v3.1.7

percpu: fix chunk range calculation

2011-12-21T20:58:30+00:00

commit a855b84c3d8c73220d4d3cd392a7bee7c83de70e upstream.

Percpu allocator recorded the cpus which map to the first and last
units in pcpu_first/last_unit_cpu respectively and used them to
determine the address range of a chunk - e.g. it assumed that the
first unit has the lowest address in a chunk while the last unit has
the highest address.

This simply isn't true.  Groups in a chunk can have arbitrary positive
or negative offsets from the previous one and there is no guarantee
that the first unit occupies the lowest offset while the last one the
highest.

Fix it by actually comparing unit offsets to determine cpus occupying
the lowest and highest offsets.  Also, rename pcu_first/last_unit_cpu
to pcpu_low/high_unit_cpu to avoid confusion.

The chunk address range is used to flush cache on vmalloc area
map/unmap and decide whether a given address is in the first chunk by
per_cpu_ptr_to_phys() and the bug was discovered by invalid
per_cpu_ptr_to_phys() translation for crash_note.

Kudos to Dave Young for tracking down the problem.

Signed-off-by: Tejun Heo 
Reported-by: WANG Cong 
Reported-by: Dave Young 
Tested-by: Dave Young 
LKML-Reference: <4EC21F67.10905@redhat.com>
Signed-off-by: Thomas Renninger 
Signed-off-by: Greg Kroah-Hartman

mm: vmalloc: check for page allocation failure before vmlist insertion

2011-12-21T20:58:24+00:00

commit 1368edf0647ac112d8cfa6ce47257dc950c50f5c upstream.

Commit f5252e00 ("mm: avoid null pointer access in vm_struct via
/proc/vmallocinfo") adds newly allocated vm_structs to the vmlist after
it is fully initialised.  Unfortunately, it did not check that
__vmalloc_area_node() successfully populated the area.  In the event of
allocation failure, the vmalloc area is freed but the pointer to freed
memory is inserted into the vmlist leading to a a crash later in
get_vmalloc_info().

This patch adds a check for ____vmalloc_area_node() failure within
__vmalloc_node_range.  It does not use "goto fail" as in the previous
error path as a warning was already displayed by __vmalloc_area_node()
before it called vfree in its failure path.

Credit goes to Luciano Chavez for doing all the real work of identifying
exactly where the problem was.

Signed-off-by: Mel Gorman 
Reported-by: Luciano Chavez 
Tested-by: Luciano Chavez 
Reviewed-by: Rik van Riel 
Acked-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: Ensure that pfn_valid() is called once per pageblock when reserving pageblocks

2011-12-21T20:58:23+00:00

commit d021563888312018ca65681096f62e36c20e63cc upstream.

setup_zone_migrate_reserve() expects that zone->start_pfn starts at
pageblock_nr_pages aligned pfn otherwise we could access beyond an
existing memblock resulting in the following panic if
CONFIG_HOLES_IN_ZONE is not configured and we do not check pfn_valid:

  IP: [] setup_zone_migrate_reserve+0xcd/0x180
  *pdpt = 0000000000000000 *pde = f000ff53f000ff53
  Oops: 0000 [#1] SMP
  Pid: 1, comm: swapper Not tainted 3.0.7-0.7-pae #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
  EIP: 0060:[] EFLAGS: 00010006 CPU: 0
  EIP is at setup_zone_migrate_reserve+0xcd/0x180
  EAX: 000c0000 EBX: f5801fc0 ECX: 000c0000 EDX: 00000000
  ESI: 000c01fe EDI: 000c01fe EBP: 00140000 ESP: f2475f58
  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
  Process swapper (pid: 1, ti=f2474000 task=f2472cd0 task.ti=f2474000)
  Call Trace:
  [] __setup_per_zone_wmarks+0xec/0x160
  [] setup_per_zone_wmarks+0xf/0x20
  [] init_per_zone_wmark_min+0x27/0x86
  [] do_one_initcall+0x2b/0x160
  [] kernel_init+0xbe/0x157
  [] kernel_thread_helper+0x6/0xd
  Code: a5 39 f5 89 f7 0f 46 fd 39 cf 76 40 8b 03 f6 c4 08 74 32 eb 91 90 89 c8 c1 e8 0e 0f be 80 80 2f 86 c0 8b 14 85 60 2f 86 c0 89 c8 <2b> 82 b4 12 00 00 c1 e0 05 03 82 ac 12 00 00 8b 00 f6 c4 08 0f
  EIP: [] setup_zone_migrate_reserve+0xcd/0x180 SS:ESP 0068:f2475f58
  CR2: 00000000000012b4

We crashed in pageblock_is_reserved() when accessing pfn 0xc0000 because
highstart_pfn = 0x36ffe.

The issue was introduced in 3.0-rc1 by 6d3163ce ("mm: check if any page
in a pageblock is reserved before marking it MIGRATE_RESERVE").

Make sure that start_pfn is always aligned to pageblock_nr_pages to
ensure that pfn_valid s always called at the start of each pageblock.
Architectures with holes in pageblocks will be correctly handled by
pfn_valid_within in pageblock_is_reserved.

Signed-off-by: Michal Hocko 
Signed-off-by: Mel Gorman 
Tested-by: Dang Bo 
Reviewed-by: KAMEZAWA Hiroyuki 
Cc: Andrea Arcangeli 
Cc: David Rientjes 
Cc: Arve Hjnnevg 
Cc: KOSAKI Motohiro 
Cc: John Stultz 
Cc: Dave Hansen 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

thp: set compound tail page _count to zero

2011-12-21T20:58:22+00:00

commit 58a84aa92723d1ac3e1cc4e3b0ff49291663f7e1 upstream.

Commit 70b50f94f1644 ("mm: thp: tail page refcounting fix") keeps all
page_tail->_count zero at all times.  But the current kernel does not
set page_tail->_count to zero if a 1GB page is utilized.  So when an
IOMMU 1GB page is used by KVM, it wil result in a kernel oops because a
tail page's _count does not equal zero.

  kernel BUG at include/linux/mm.h:386!
  invalid opcode: 0000 [#1] SMP
  Call Trace:
    gup_pud_range+0xb8/0x19d
    get_user_pages_fast+0xcb/0x192
    ? trace_hardirqs_off+0xd/0xf
    hva_to_pfn+0x119/0x2f2
    gfn_to_pfn_memslot+0x2c/0x2e
    kvm_iommu_map_pages+0xfd/0x1c1
    kvm_iommu_map_memslots+0x7c/0xbd
    kvm_iommu_map_guest+0xaa/0xbf
    kvm_vm_ioctl_assigned_device+0x2ef/0xa47
    kvm_vm_ioctl+0x36c/0x3a2
    do_vfs_ioctl+0x49e/0x4e4
    sys_ioctl+0x5a/0x7c
    system_call_fastpath+0x16/0x1b
  RIP  gup_huge_pud+0xf2/0x159

Signed-off-by: Youquan Song 
Reviewed-by: Andrea Arcangeli 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

slab, lockdep: Fix silly bug

2011-12-09T16:55:48+00:00

commit 52cef189165d74a5d6030184a8e05595194c69ca upstream.

Commit 30765b92 ("slab, lockdep: Annotate the locks before using
them") moves the init_lock_keys() call from after g_cpucache_up =
FULL, to before it. And overlooks the fact that init_node_lock_keys()
tests for it and ignores everything !FULL.

Introduce a LATE stage and change the lockdep test to be 
Cc: Pekka Enberg 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

hugetlb: release pages in the error path of hugetlb_cow()

2011-12-09T16:54:31+00:00

commit ea4039a34c4c206d015d34a49d0b00868e37db1d upstream.

If we fail to prepare an anon_vma, the {new, old}_page should be released,
or they will leak.

Signed-off-by: Hillf Danton 
Reviewed-by: Andrea Arcangeli 
Cc: Hugh Dickins 
Cc: Johannes Weiner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Cc: Michal Hocko 
Signed-off-by: Greg Kroah-Hartman

backing-dev: ensure wakeup_timer is deleted

2011-11-21T22:35:28+00:00

commit 7a401a972df8e184b3d1a3fc958c0a4ddee8d312 upstream.

bdi_prune_sb() in bdi_unregister() attempts to removes the bdi links
from all super_blocks and then del_timer_sync() the writeback timer.

However, this can race with __mark_inode_dirty(), leading to
bdi_wakeup_thread_delayed() rearming the writeback timer on the bdi
we're unregistering, after we've called del_timer_sync().

This can end up with the bdi being freed with an active timer inside it,
as in the case of the following dump after the removal of an SD card.

Fix this by redoing the del_timer_sync() in bdi_destory().

 ------------[ cut here ]------------
 WARNING: at /home/rabin/kernel/arm/lib/debugobjects.c:262 debug_print_object+0x9c/0xc8()
 ODEBUG: free active (active state 0) object type: timer_list hint: wakeup_timer_fn+0x0/0x180
 Modules linked in:
 Backtrace:
 [] (dump_backtrace+0x0/0x110) from [] (dump_stack+0x18/0x1c)
  r6:c02bc638 r5:00000106 r4:c79f5d18 r3:00000000
 [] (dump_stack+0x0/0x1c) from [] (warn_slowpath_common+0x54/0x6c)
 [] (warn_slowpath_common+0x0/0x6c) from [] (warn_slowpath_fmt+0x38/0x40)
  r8:20000013 r7:c780c6f0 r6:c031613c r5:c780c6f0 r4:c02b1b29
 r3:00000009
 [] (warn_slowpath_fmt+0x0/0x40) from [] (debug_print_object+0x9c/0xc8)
  r3:c02b1b29 r2:c02bc662
 [] (debug_print_object+0x0/0xc8) from [] (debug_check_no_obj_freed+0xac/0x1dc)
  r6:c7964000 r5:00000001 r4:c7964000
 [] (debug_check_no_obj_freed+0x0/0x1dc) from [] (kmem_cache_free+0x88/0x1f8)
 [] (kmem_cache_free+0x0/0x1f8) from [] (blk_release_queue+0x70/0x78)
 [] (blk_release_queue+0x0/0x78) from [] (kobject_release+0x70/0x84)
  r5:c79641f0 r4:c796420c
 [] (kobject_release+0x0/0x84) from [] (kref_put+0x68/0x80)
  r7:00000083 r6:c74083d0 r5:c015289c r4:c796420c
 [] (kref_put+0x0/0x80) from [] (kobject_put+0x48/0x5c)
  r5:c79643b4 r4:c79641f0
 [] (kobject_put+0x0/0x5c) from [] (blk_cleanup_queue+0x68/0x74)
  r4:c7964000
 [] (blk_cleanup_queue+0x0/0x74) from [] (mmc_blk_put+0x78/0xe8)
  r5:00000000 r4:c794c400
 [] (mmc_blk_put+0x0/0xe8) from [] (mmc_blk_release+0x24/0x38)
  r5:c794c400 r4:c0322824
 [] (mmc_blk_release+0x0/0x38) from [] (__blkdev_put+0xe8/0x170)
  r5:c78d5e00 r4:c74083c0
 [] (__blkdev_put+0x0/0x170) from [] (blkdev_put+0x11c/0x12c)
  r8:c79f5f70 r7:00000001 r6:c74083d0 r5:00000083 r4:c74083c0
 r3:00000000
 [] (blkdev_put+0x0/0x12c) from [] (kill_block_super+0x60/0x6c)
  r7:c7942300 r6:c79f4000 r5:00000083 r4:c74083c0
 [] (kill_block_super+0x0/0x6c) from [] (deactivate_locked_super+0x44/0x70)
  r6:c79f4000 r5:c031af64 r4:c794dc00 r3:c00b06c4
 [] (deactivate_locked_super+0x0/0x70) from [] (deactivate_super+0x6c/0x70)
  r5:c794dc00 r4:c794dc00
 [] (deactivate_super+0x0/0x70) from [] (mntput_no_expire+0x188/0x194)
  r5:c794dc00 r4:c7942300
 [] (mntput_no_expire+0x0/0x194) from [] (sys_umount+0x2e4/0x310)
  r6:c7942300 r5:00000000 r4:00000000 r3:00000000
 [] (sys_umount+0x0/0x310) from [] (ret_fast_syscall+0x0/0x30)
 ---[ end trace e5c83c92ada51c76 ]---

Signed-off-by: Rabin Vincent 
Signed-off-by: Linus Walleij 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

mm: thp: tail page refcounting fix

2011-11-11T17:43:41+00:00

commit 70b50f94f1644e2aa7cb374819cfd93f3c28d725 upstream.

Michel while working on the working set estimation code, noticed that
calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
wasn't safe, if the pfn ended up being a tail page of a transparent
hugepage under splitting by __split_huge_page_refcount().

He then found the problem could also theoretically materialize with
page_cache_get_speculative() during the speculative radix tree lookups
that uses get_page_unless_zero() in SMP if the radix tree page is freed
and reallocated and get_user_pages is called on it before
page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail->_count zero at
all times.  This will guarantee that get_page_unless_zero() can never
succeed on any tail page.  page_tail->_mapcount is guaranteed zero and
is unused for all tail pages of a compound page, so we can simply
account the tail page references there and transfer them to
tail_page->_count in __split_huge_page_refcount() (in addition to the
head_page->_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages.  That wasn't
entirely safe because the two atomic_inc in get_page weren't atomic.  As
opposed to other get_user_page users like secondary-MMU page fault to
establish the shadow pagetables would never call any superflous get_page
after get_user_page returns.  It's safer to make get_page universally safe
for tail pages and to use get_page_foll() within follow_page (inside
get_user_pages()).  get_page_foll() is safe to do the refcounting for tail
pages without taking any locks because it is run within PT lock protected
critical sections (PT lock for pte and page_table_lock for
pmd_trans_huge).

The standard get_page() as invoked by direct-io instead will now take
the compound_lock but still only for tail pages.  The direct-io paths
are usually I/O bound and the compound_lock is per THP so very
finegrined, so there's no risk of scalability issues with it.  A simple
direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no
overhead.  So it's worth it.  Ideally direct-io should stop calling
get_page() on pages returned by get_user_pages().  The spinlock in
get_page() is already optimized away for no-THP builds but doing
get_page() on tail pages returned by GUP is generally a rare operation
and usually only run in I/O paths.

This new refcounting on page_tail->_mapcount in addition to avoiding new
RCU critical sections will also allow the working set estimation code to
work without any further complexity associated to the tail page
refcounting with THP.

Signed-off-by: Andrea Arcangeli 
Reported-by: Michel Lespinasse 
Reviewed-by: Michel Lespinasse 
Reviewed-by: Minchan Kim 
Cc: Peter Zijlstra 
Cc: Hugh Dickins 
Cc: Johannes Weiner 
Cc: Rik van Riel 
Cc: Mel Gorman 
Cc: KOSAKI Motohiro 
Cc: Benjamin Herrenschmidt 
Cc: David Gibson 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: avoid null pointer access in vm_struct via /proc/vmallocinfo

2011-11-11T17:43:33+00:00

commit f5252e009d5b87071a919221e4f6624184005368 upstream.

The /proc/vmallocinfo shows information about vmalloc allocations in
vmlist that is a linklist of vm_struct.  It, however, may access pages
field of vm_struct where a page was not allocated.  This results in a null
pointer access and leads to a kernel panic.

Why this happens: In __vmalloc_node_range() called from vmalloc(), newly
allocated vm_struct is added to vmlist at __get_vm_area_node() and then,
some fields of vm_struct such as nr_pages and pages are set at
__vmalloc_area_node().  In other words, it is added to vmlist before it is
fully initialized.  At the same time, when the /proc/vmallocinfo is read,
it accesses the pages field of vm_struct according to the nr_pages field
at show_numa_info().  Thus, a null pointer access happens.

The patch adds the newly allocated vm_struct to the vmlist *after* it is
fully initialized.  So, it can avoid accessing the pages field with
unallocated page when show_numa_info() is called.

Signed-off-by: Mitsuo Hayasaka 
Cc: Andrew Morton 
Cc: David Rientjes 
Cc: Namhyung Kim 
Cc: "Paul E. McKenney" 
Cc: Jeremy Fitzhardinge 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: fix race between mremap and removing migration entry

2011-10-20T06:42:58+00:00

I don't usually pay much attention to the stale "? " addresses in
stack backtraces, but this lucky report from Pawel Sikora hints that
mremap's move_ptes() has inadequate locking against page migration.

 3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
 kernel BUG at include/linux/swapops.h:105!
 RIP: 0010:[]  []
                       migration_entry_wait+0x156/0x160
  [] handle_pte_fault+0xae1/0xaf0
  [] ? __pte_alloc+0x42/0x120
  [] ? do_huge_pmd_anonymous_page+0xab/0x310
  [] handle_mm_fault+0x181/0x310
  [] ? vma_adjust+0x537/0x570
  [] do_page_fault+0x11d/0x4e0
  [] ? do_mremap+0x2d5/0x570
  [] page_fault+0x1f/0x30

mremap's down_write of mmap_sem, together with i_mmap_mutex or lock,
and pagetable locks, were good enough before page migration (with its
requirement that every migration entry be found) came in, and enough
while migration always held mmap_sem; but not enough nowadays, when
there's memory hotremove and compaction.

The danger is that move_ptes() lets a migration entry dodge around
behind remove_migration_pte()'s back, so it's in the old location when
looking at the new, then in the new location when looking at the old.

Either mremap's move_ptes() must additionally take anon_vma lock(), or
migration's remove_migration_pte() must stop peeking for is_swap_entry()
before it takes pagetable lock.

Consensus chooses the latter: we prefer to add overhead to migration
than to mremapping, which gets used by JVMs and by exec stack setup.

Reported-and-tested-by: Paweł Sikora 
Signed-off-by: Hugh Dickins 
Acked-by: Andrea Arcangeli 
Acked-by: Mel Gorman 
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds