linux.git/mm, branch v3.10

Merge branch 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux

2013-06-18T16:27:47+00:00

Pull SLAB fix from Pekka Enberg:
 "A slab regression fix by Sasha Levin"

* 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
  slab: prevent warnings when allocating with __GFP_NOWARN

slab: prevent warnings when allocating with __GFP_NOWARN

2013-06-13T07:01:58+00:00

Sasha Levin noticed that the warning introduced by commit 6286ae9
("slab: Return NULL for oversized allocations) is being triggered:

  WARNING: CPU: 15 PID: 21519 at mm/slab_common.c:376 kmalloc_slab+0x2f/0xb0()
  can: request_module (can-proto-4) failed.
  mpoa: proc_mpc_write: could not parse ''
  Modules linked in:
  CPU: 15 PID: 21519 Comm: trinity-child15 Tainted: G W    3.10.0-rc4-next-20130607-sasha-00011-gcd78395-dirty #2
   0000000000000009 ffff880020a95e30 ffffffff83ff4041 0000000000000000
   ffff880020a95e68 ffffffff8111fe12 fffffffffffffff0 00000000000082d0
   0000000000080000 0000000000080000 0000000001400000 ffff880020a95e78
  Call Trace:
   [] dump_stack+0x4e/0x82
   [] warn_slowpath_common+0x82/0xb0
   [] warn_slowpath_null+0x15/0x20
   [] kmalloc_slab+0x2f/0xb0
   [] __kmalloc+0x24/0x4b0
   [] ? security_capable+0x13/0x20
   [] ? pipe_fcntl+0x107/0x210
   [] pipe_fcntl+0x107/0x210
   [] ? fget_raw_light+0x130/0x3f0
   [] SyS_fcntl+0x60b/0x6a0
   [] tracesys+0xe1/0xe6

Andrew Morton writes:

  __GFP_NOWARN is frequently used by kernel code to probe for "how big
  an allocation can I get".  That's a bit lame, but it's used on slow
  paths and is pretty simple.

However, SLAB would still spew a warning when a big allocation happens
if the __GFP_NOWARN flag is _not_ set to expose kernel bugs.

Signed-off-by: Sasha Levin 
[ penberg@kernel.org: improve changelog ]
Signed-off-by: Pekka Enberg

mm: memcontrol: fix lockless reclaim hierarchy iterator

2013-06-12T23:29:46+00:00

The lockless reclaim hierarchy iterator currently has a misplaced
barrier that can lead to use-after-free crashes.

The reclaim hierarchy iterator consist of a sequence count and a
position pointer that are read and written locklessly, with memory
barriers enforcing ordering.

The write side sets the position pointer first, then updates the
sequence count to "publish" the new position.  Likewise, the read side
must read the sequence count first, then the position.  If the sequence
count is up to date, it's guaranteed that the position is up to date as
well:

  writer:                         reader:
  iter->position = position       if iter->sequence == expected:
  smp_wmb()                           smp_rmb()
  iter->sequence = sequence           position = iter->position

However, the read side barrier is currently misplaced, which can lead to
dereferencing stale position pointers that no longer point to valid
memory.  Fix this.

Signed-off-by: Johannes Weiner 
Reported-by: Tejun Heo 
Reviewed-by: Tejun Heo 
Acked-by: Michal Hocko 
Cc: KAMEZAWA Hiroyuki 
Cc: Glauber Costa 
Cc: 		[3.10+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

frontswap: fix incorrect zeroing and allocation size for frontswap_map

2013-06-12T23:29:46+00:00

The bitmap accessed by bitops must have enough size to hold the required
numbers of bits rounded up to a multiple of BITS_PER_LONG.  And the
bitmap must not be zeroed by memset() if the number of bits cleared is
not a multiple of BITS_PER_LONG.

This fixes incorrect zeroing and allocation size for frontswap_map.  The
incorrect zeroing part doesn't cause any problem because frontswap_map
is freed just after zeroing.  But the wrongly calculated allocation size
may cause the problem.

For 32bit systems, the allocation size of frontswap_map is about twice
as large as required size.  For 64bit systems, the allocation size is
smaller than requeired if the number of bits is not a multiple of
BITS_PER_LONG.

Signed-off-by: Akinobu Mita 
Cc: Konrad Rzeszutek Wilk 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: migration: add migrate_entry_wait_huge()

2013-06-12T23:29:46+00:00

When we have a page fault for the address which is backed by a hugepage
under migration, the kernel can't wait correctly and do busy looping on
hugepage fault until the migration finishes.  As a result, users who try
to kick hugepage migration (via soft offlining, for example) occasionally
experience long delay or soft lockup.

This is because pte_offset_map_lock() can't get a correct migration entry
or a correct page table lock for hugepage.  This patch introduces
migration_entry_wait_huge() to solve this.

Signed-off-by: Naoya Horiguchi 
Reviewed-by: Rik van Riel 
Reviewed-by: Wanpeng Li 
Reviewed-by: Michal Hocko 
Cc: Mel Gorman 
Cc: Andi Kleen 
Cc: KOSAKI Motohiro 
Cc: 	[2.6.35+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/page_alloc.c: fix watermark check in __zone_watermark_ok()

2013-06-12T23:29:46+00:00

The watermark check consists of two sub-checks.  The first one is:

	if (free_pages <= min + lowmem_reserve)
		return false;

The check assures that there is minimal amount of RAM in the zone.  If
CMA is used then the free_pages is reduced by the number of free pages
in CMA prior to the over-mentioned check.

	if (!(alloc_flags & ALLOC_CMA))
		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);

This prevents the zone from being drained from pages available for
non-movable allocations.

The second check prevents the zone from getting too fragmented.

	for (o = 0; o < order; o++) {
		free_pages -= z->free_area[o].nr_free << o;
		min >>= 1;
		if (free_pages <= min)
			return false;
	}

The field z->free_area[o].nr_free is equal to the number of free pages
including free CMA pages.  Therefore the CMA pages are subtracted twice.
This may cause a false positive fail of __zone_watermark_ok() if the CMA
area gets strongly fragmented.  In such a case there are many 0-order
free pages located in CMA.  Those pages are subtracted twice therefore
they will quickly drain free_pages during the check against
fragmentation.  The test fails even though there are many free non-cma
pages in the zone.

This patch fixes this issue by subtracting CMA pages only for a purpose of
(free_pages <= min + lowmem_reserve) check.

Laura said:

  We were observing allocation failures of higher order pages (order 5 =
  128K typically) under tight memory conditions resulting in driver
  failure.  The output from the page allocation failure showed plenty of
  free pages of the appropriate order/type/zone and mostly CMA pages in
  the lower orders.

  For full disclosure, we still observed some page allocation failures
  even after applying the patch but the number was drastically reduced and
  those failures were attributed to fragmentation/other system issues.

Signed-off-by: Tomasz Stanislawski 
Signed-off-by: Kyungmin Park 
Tested-by: Laura Abbott 
Cc: Bartlomiej Zolnierkiewicz 
Acked-by: Minchan Kim 
Cc: Mel Gorman 
Tested-by: Marek Szyprowski 
Cc: 	[3.7+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O completion

2013-06-12T23:29:45+00:00

read_swap_cache_async() can race against get_swap_page(), and stumble
across a SWAP_HAS_CACHE entry in the swap map whose page wasn't brought
into the swapcache yet.

This transient swap_map state is expected to be transitory, but the
actual placement of discard at scan_swap_map() inserts a wait for I/O
completion thus making the thread at read_swap_cache_async() to loop
around its -EEXIST case, while the other end at get_swap_page() is
scheduled away at scan_swap_map().  This can leave the system deadlocked
if the I/O completion happens to be waiting on the CPU waitqueue where
read_swap_cache_async() is busy looping and !CONFIG_PREEMPT.

This patch introduces a cond_resched() call to make the aforementioned
read_swap_cache_async() busy loop condition to bail out when necessary,
thus avoiding the subtle race window.

Signed-off-by: Rafael Aquini 
Acked-by: Johannes Weiner 
Acked-by: KOSAKI Motohiro 
Acked-by: Hugh Dickins 
Cc: Shaohua Li 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg: don't initialize kmem-cache destroying work for root caches

2013-06-12T23:29:45+00:00

struct memcg_cache_params has a union.  Different parts of this union
are used for root and non-root caches.  A part with destroying work is
used only for non-root caches.

  BUG: unable to handle kernel paging request at 0000000fffffffe0
  IP: kmem_cache_alloc+0x41/0x1f0
  Modules linked in: netlink_diag af_packet_diag udp_diag tcp_diag inet_diag unix_diag ip6table_filter ip6_tables i2c_piix4 virtio_net virtio_balloon microcode i2c_core pcspkr floppy
  CPU: 0 PID: 1929 Comm: lt-vzctl Tainted: G      D      3.10.0-rc1+ #2
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  RIP: kmem_cache_alloc+0x41/0x1f0
  Call Trace:
   getname_flags.part.34+0x30/0x140
   getname+0x38/0x60
   do_sys_open+0xc5/0x1e0
   SyS_open+0x22/0x30
   system_call_fastpath+0x16/0x1b
  Code: f4 53 48 83 ec 18 8b 05 8e 53 b7 00 4c 8b 4d 08 21 f0 a8 10 74 0d 4c 89 4d c0 e8 1b 76 4a 00 4c 8b 4d c0 e9 92 00 00 00 4d 89 f5 <4d> 8b 45 00 65 4c 03 04 25 48 cd 00 00 49 8b 50 08 4d 8b 38 49
  RIP  [] kmem_cache_alloc+0x41/0x1f0

Signed-off-by: Andrey Vagin 
Cc: Konstantin Khlebnikov 
Cc: Glauber Costa 
Cc: Johannes Weiner 
Cc: Balbir Singh 
Cc: KAMEZAWA Hiroyuki 
Reviewed-by: Michal Hocko 
Cc: Li Zefan 
Cc: 	[3.9.x]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

arch, mm: Remove tlb_fast_mode()

2013-06-06T01:07:26+00:00

Since the introduction of preemptible mmu_gather TLB fast mode has been
broken. TLB fast mode relies on there being absolutely no concurrency;
it frees pages first and invalidates TLBs later.

However now we can get concurrency and stuff goes *bang*.

This patch removes all tlb_fast_mode() code; it was found the better
option vs trying to patch the hole by entangling tlb invalidation with
the scheduler.

Cc: Thomas Gleixner 
Cc: Russell King 
Cc: Tony Luck 
Reported-by: Max Filippov 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Linus Torvalds

mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas

2013-05-24T23:22:53+00:00

A panic can be caused by simply cat'ing /proc//smaps while an
application has a VM_PFNMAP range.  It happened in-house when a
benchmarker was trying to decipher the memory layout of his program.

/proc//smaps and similar walks through a user page table should not
be looking at VM_PFNMAP areas.

Certain tests in walk_page_range() (specifically split_huge_page_pmd())
assume that all the mapped PFN's are backed with page structures.  And
this is not usually true for VM_PFNMAP areas.  This can result in panics
on kernel page faults when attempting to address those page structures.

There are a half dozen callers of walk_page_range() that walk through a
task's entire page table (as N.  Horiguchi pointed out).  So rather than
change all of them, this patch changes just walk_page_range() to ignore
VM_PFNMAP areas.

The logic of hugetlb_vma() is moved back into walk_page_range(), as we
want to test any vma in the range.

VM_PFNMAP areas are used by:
- graphics memory manager   gpu/drm/drm_gem.c
- global reference unit     sgi-gru/grufile.c
- sgi special memory        char/mspec.c
- and probably several out-of-tree modules

[akpm@linux-foundation.org: remove now-unused hugetlb_vma() stub]
Signed-off-by: Cliff Wickman 
Reviewed-by: Naoya Horiguchi 
Cc: Mel Gorman 
Cc: Andrea Arcangeli 
Cc: Dave Hansen 
Cc: David Sterba 
Cc: Johannes Weiner 
Cc: KOSAKI Motohiro 
Cc: "Kirill A. Shutemov" 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds