linux-stable.git/mm, branch v3.14.2

bdi: avoid oops on device removal

2014-04-27T00:19:05+00:00

commit 5acda9d12dcf1ad0d9a5a2a7c646de3472fa7555 upstream.

After commit 839a8e8660b6 ("writeback: replace custom worker pool
implementation with unbound workqueue") when device is removed while we
are writing to it we crash in bdi_writeback_workfn() ->
set_worker_desc() because bdi->dev is NULL.

This can happen because even though bdi_unregister() cancels all pending
flushing work, nothing really prevents new ones from being queued from
balance_dirty_pages() or other places.

Fix the problem by clearing BDI_registered bit in bdi_unregister() and
checking it before scheduling of any flushing work.

Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977

Reviewed-by: Tejun Heo 
Signed-off-by: Jan Kara 
Cc: Derek Basehore 
Cc: Jens Axboe 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

backing_dev: fix hung task on sync

2014-04-27T00:19:05+00:00

commit 6ca738d60c563d5c6cf6253ee4b8e76fa77b2b9e upstream.

bdi_wakeup_thread_delayed() used the mod_delayed_work() function to
schedule work to writeback dirty inodes.  The problem with this is that
it can delay work that is scheduled for immediate execution, such as the
work from sync_inodes_sb().  This can happen since mod_delayed_work()
can now steal work from a work_queue.  This fixes the problem by using
queue_delayed_work() instead.  This is a regression caused by commit
839a8e8660b6 ("writeback: replace custom worker pool implementation with
unbound workqueue").

The reason that this causes a problem is that laptop-mode will change
the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default.
In the case that bdi_wakeup_thread_delayed() races with
sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung
task.  Even if dirty_writeback_centisecs is not long enough to cause a
hung task, we still don't want to delay sync for that long.

We fix the problem by using queue_delayed_work() when we want to
schedule writeback sometime in future.  This function doesn't change the
timer if it is already armed.

For the same reason, we also change bdi_writeback_workfn() to
immediately queue the work again in the case that the work_list is not
empty.  The same problem can happen if the sync work is run on the
rescue worker.

[jack@suse.cz: update changelog, add comment, use bdi_wakeup_thread_delayed()]
Signed-off-by: Derek Basehore 
Reviewed-by: Jan Kara 
Cc: Alexander Viro 
Reviewed-by: Tejun Heo 
Cc: Greg Kroah-Hartman 
Cc: "Darrick J. Wong" 
Cc: Derek Basehore 
Cc: Kees Cook 
Cc: Benson Leung 
Cc: Sonny Rao 
Cc: Luigi Semenzato 
Cc: Jens Axboe 
Cc: Dave Chinner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: fix swapops.h:131 bug if remap_file_pages raced migration

2014-03-21T05:09:09+00:00

Add remove_linear_migration_ptes_from_nonlinear(), to fix an interesting
little include/linux/swapops.h:131 BUG_ON(!PageLocked) found by trinity:
indicating that remove_migration_ptes() failed to find one of the
migration entries that was temporarily inserted.

The problem comes from remap_file_pages()'s switch from vma_interval_tree
(good for inserting the migration entry) to i_mmap_nonlinear list (no good
for locating it again); but can only be a problem if the remap_file_pages()
range does not cover the whole of the vma (zap_pte() clears the range).

remove_migration_ptes() needs a file_nonlinear method to go down the
i_mmap_nonlinear list, applying linear location to look for migration
entries in those vmas too, just in case there was this race.

The file_nonlinear method does need rmap_walk_control.arg to do this;
but it never needed vma passed in - vma comes from its own iteration.

Reported-and-tested-by: Dave Jones 
Reported-and-tested-by: Sasha Levin 
Signed-off-by: Hugh Dickins 
Signed-off-by: Linus Torvalds

mm: fix bad rss-counter if remap_file_pages raced migration

2014-03-19T23:21:49+00:00

Fix some "Bad rss-counter state" reports on exit, arising from the
interaction between page migration and remap_file_pages(): zap_pte()
must count a migration entry when zapping it.

And yes, it is possible (though very unusual) to find an anon page or
swap entry in a VM_SHARED nonlinear mapping: coming from that horrid
get_user_pages(write, force) case which COWs even in a shared mapping.

Signed-off-by: Hugh Dickins 
Tested-by: Sasha Levin sasha.levin@oracle.com>
Tested-by: Dave Jones davej@redhat.com>
Cc: Cyrill Gorcunov 
Cc: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/Kconfig: fix URL for zsmalloc benchmark

2014-03-11T00:26:20+00:00

The help text for CONFIG_PGTABLE_MAPPING has an incorrect URL.  While
we're at it, remove the unnecessary footnote notation.

Signed-off-by: Ben Hutchings 
Acked-by: Minchan Kim 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/compaction: break out of loop on !PageBuddy in isolate_freepages_block

2014-03-11T00:26:20+00:00

We received several reports of bad page state when freeing CMA pages
previously allocated with alloc_contig_range:

    BUG: Bad page state in process Binder_A  pfn:63202
    page:d21130b0 count:0 mapcount:1 mapping:  (null) index:0x7dfbf
    page flags: 0x40080068(uptodate|lru|active|swapbacked)

Based on the page state, it looks like the page was still in use.  The
page flags do not make sense for the use case though.  Further debugging
showed that despite alloc_contig_range returning success, at least one
page in the range still remained in the buddy allocator.

There is an issue with isolate_freepages_block.  In strict mode (which
CMA uses), if any pages in the range cannot be isolated,
isolate_freepages_block should return failure 0.  The current check
keeps track of the total number of isolated pages and compares against
the size of the range:

        if (strict && nr_strict_required > total_isolated)
                total_isolated = 0;

After taking the zone lock, if one of the pages in the range is not in
the buddy allocator, we continue through the loop and do not increment
total_isolated.  If in the last iteration of the loop we isolate more
than one page (e.g.  last page needed is a higher order page), the check
for total_isolated may pass and we fail to detect that a page was
skipped.  The fix is to bail out if the loop immediately if we are in
strict mode.  There's no benfit to continuing anyway since we need all
pages to be isolated.  Additionally, drop the error checking based on
nr_strict_required and just check the pfn ranges.  This matches with
what isolate_freepages_range does.

Signed-off-by: Laura Abbott 
Acked-by: Minchan Kim 
Cc: Mel Gorman 
Acked-by: Vlastimil Babka 
Cc: Joonsoo Kim 
Acked-by: Bartlomiej Zolnierkiewicz 
Acked-by: Michal Nazarewicz 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: fix GFP_THISNODE callers and clarify

2014-03-11T00:26:19+00:00

GFP_THISNODE is for callers that implement their own clever fallback to
remote nodes.  It restricts the allocation to the specified node and
does not invoke reclaim, assuming that the caller will take care of it
when the fallback fails, e.g.  through a subsequent allocation request
without GFP_THISNODE set.

However, many current GFP_THISNODE users only want the node exclusive
aspect of the flag, without actually implementing their own fallback or
triggering reclaim if necessary.  This results in things like page
migration failing prematurely even when there is easily reclaimable
memory available, unless kswapd happens to be running already or a
concurrent allocation attempt triggers the necessary reclaim.

Convert all callsites that don't implement their own fallback strategy
to __GFP_THISNODE.  This restricts the allocation a single node too, but
at the same time allows the allocator to enter the slowpath, wake
kswapd, and invoke direct reclaim if necessary, to make the allocation
happen when memory is full.

Signed-off-by: Johannes Weiner 
Acked-by: Rik van Riel 
Cc: Jan Stancek 
Cc: Mel Gorman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: page_alloc: exempt GFP_THISNODE allocations from zone fairness

2014-03-04T15:55:50+00:00

Jan Stancek reports manual page migration encountering allocation
failures after some pages when there is still plenty of memory free, and
bisected the problem down to commit 81c0a2bb515f ("mm: page_alloc: fair
zone allocator policy").

The problem is that GFP_THISNODE obeys the zone fairness allocation
batches on one hand, but doesn't reset them and wake kswapd on the other
hand.  After a few of those allocations, the batches are exhausted and
the allocations fail.

Fixing this means either having GFP_THISNODE wake up kswapd, or
GFP_THISNODE not participating in zone fairness at all.  The latter
seems safer as an acute bugfix, we can clean up later.

Reported-by: Jan Stancek 
Signed-off-by: Johannes Weiner 
Acked-by: Rik van Riel 
Acked-by: Mel Gorman 
Cc: 		[3.12+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: include VM_MIXEDMAP flag in the VM_SPECIAL list to avoid m(un)locking

2014-03-04T15:55:48+00:00

Daniel Borkmann reported a VM_BUG_ON assertion failing:

  ------------[ cut here ]------------
  kernel BUG at mm/mlock.c:528!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: ccm arc4 iwldvm [...]
   video
  CPU: 3 PID: 2266 Comm: netsniff-ng Not tainted 3.14.0-rc2+ #8
  Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
  task: ffff8801f87f9820 ti: ffff88002cb44000 task.ti: ffff88002cb44000
  RIP: 0010:[]  [] munlock_vma_pages_range+0x2e0/0x2f0
  Call Trace:
    do_munmap+0x18f/0x3b0
    vm_munmap+0x41/0x60
    SyS_munmap+0x22/0x30
    system_call_fastpath+0x1a/0x1f
  RIP   munlock_vma_pages_range+0x2e0/0x2f0
  ---[ end trace a0088dcf07ae10f2 ]---

because munlock_vma_pages_range() thinks it's unexpectedly in the middle
of a THP page.  This can be reproduced with default config since 3.11
kernels.  A reproducer can be found in the kernel's selftest directory
for networking by running ./psock_tpacket.

The problem is that an order=2 compound page (allocated by
alloc_one_pg_vec_page() is part of the munlocked VM_MIXEDMAP vma (mapped
by packet_mmap()) and mistaken for a THP page and assumed to be order=9.

The checks for THP in munlock came with commit ff6a6da60b89 ("mm:
accelerate munlock() treatment of THP pages"), i.e.  since 3.9, but did
not trigger a bug.  It just makes munlock_vma_pages_range() skip such
compound pages until the next 512-pages-aligned page, when it encounters
a head page.  This is however not a problem for vma's where mlocking has
no effect anyway, but it can distort the accounting.

Since commit 7225522bb429 ("mm: munlock: batch non-THP page isolation
and munlock+putback using pagevec") this can trigger a VM_BUG_ON in
PageTransHuge() check.

This patch fixes the issue by adding VM_MIXEDMAP flag to VM_SPECIAL, a
list of flags that make vma's non-mlockable and non-mergeable.  The
reasoning is that VM_MIXEDMAP vma's are similar to VM_PFNMAP, which is
already on the VM_SPECIAL list, and both are intended for non-LRU pages
where mlocking makes no sense anyway.  Related Lkml discussion can be
found in [2].

 [1] tools/testing/selftests/net/psock_tpacket
 [2] https://lkml.org/lkml/2014/1/10/427

Signed-off-by: Vlastimil Babka 
Signed-off-by: Daniel Borkmann 
Reported-by: Daniel Borkmann 
Tested-by: Daniel Borkmann 
Cc: Thomas Hellstrom 
Cc: John David Anglin 
Cc: HATAYAMA Daisuke 
Cc: Konstantin Khlebnikov 
Cc: Carsten Otte 
Cc: Jared Hulbert 
Tested-by: Hannes Frederic Sowa 
Cc: Kirill A. Shutemov 
Acked-by: Rik van Riel 
Cc: Andrea Arcangeli 
Cc:  [3.11.x+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg: reparent charges of children before processing parent

2014-03-04T15:55:48+00:00

Sometimes the cleanup after memcg hierarchy testing gets stuck in
mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.

There may turn out to be several causes, but a major cause is this: the
workitem to offline parent can get run before workitem to offline child;
parent's mem_cgroup_reparent_charges() circles around waiting for the
child's pages to be reparented to its lrus, but it's holding
cgroup_mutex which prevents the child from reaching its
mem_cgroup_reparent_charges().

Further testing showed that an ordered workqueue for cgroup_destroy_wq
is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
stage on the way can mess up the order before reaching the workqueue.

Instead, when offlining a memcg, call mem_cgroup_reparent_charges() on
all its children (and grandchildren, in the correct order) to have their
charges reparented first.

Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
Signed-off-by: Filipe Brandenburger 
Signed-off-by: Hugh Dickins 
Reviewed-by: Tejun Heo 
Acked-by: Michal Hocko 
Cc: Johannes Weiner 
Cc: 	[v3.10+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds