linux.git/mm, branch v3.0-rc5

memcg: fix direct softlimit reclaim to be called in limit path

2011-06-28T01:00:13+00:00

Commit d149e3b25d7c ("memcg: add the soft_limit reclaim in global direct
reclaim") adds a softlimit hook to shrink_zones().  By this, soft limit
is called as

   try_to_free_pages()
       do_try_to_free_pages()
           shrink_zones()
               mem_cgroup_soft_limit_reclaim()

Then, direct reclaim is memcg softlimit hint aware, now.

But, the memory cgroup's "limit" path can call softlimit shrinker.

   try_to_free_mem_cgroup_pages()
       do_try_to_free_pages()
           shrink_zones()
               mem_cgroup_soft_limit_reclaim()

This will cause a global reclaim when a memcg hits limit.

This is bug. soft_limit_reclaim() should be called when
scanning_global_lru(sc) == true.

And the commit adds a variable "total_scanned" for counting softlimit
scanned pages....it's not "total".  This patch removes the variable and
update sc->nr_scanned instead of it.  This will affect shrink_slab()'s
scan condition but, global LRU is scanned by softlimit and I think this
change makes sense.

TODO: avoid too much scanning of a zone when softlimit did enough work.

Signed-off-by: KAMEZAWA Hiroyuki 
Cc: Daisuke Nishimura 
Cc: Ying Han 
Cc: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: fix assertion mapping->nrpages == 0 in end_writeback()

2011-06-28T01:00:13+00:00

Under heavy memory and filesystem load, users observe the assertion
mapping->nrpages == 0 in end_writeback() trigger.  This can be caused by
page reclaim reclaiming the last page from a mapping in the following
race:

	CPU0				CPU1
  ...
  shrink_page_list()
    __remove_mapping()
      __delete_from_page_cache()
        radix_tree_delete()
					evict_inode()
					  truncate_inode_pages()
					    truncate_inode_pages_range()
					      pagevec_lookup() - finds nothing
					  end_writeback()
					    mapping->nrpages != 0 -> BUG
        page->mapping = NULL
        mapping->nrpages--

Fix the problem by doing a reliable check of mapping->nrpages under
mapping->tree_lock in end_writeback().

Analyzed by Jay , lost in LKML, and dug out
by Miklos Szeredi .

Cc: Jay 
Cc: Miklos Szeredi 
Signed-off-by: Jan Kara 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/memory-failure.c: fix spinlock vs mutex order

2011-06-28T01:00:13+00:00

We cannot take a mutex while holding a spinlock, so flip the order and
fix the locking documentation.

Signed-off-by: Peter Zijlstra 
Acked-by: Andi Kleen 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

tmpfs: add shmem_read_mapping_page_gfp

2011-06-28T01:00:12+00:00

Although it is used (by i915) on nothing but tmpfs, read_cache_page_gfp()
is unsuited to tmpfs, because it inserts a page into pagecache before
calling the filesystem's ->readpage: tmpfs may have pages in swapcache
which only it knows how to locate and switch to filecache.

At present tmpfs provides a ->readpage method, and copes with this by
copying pages; but soon we can simplify it by removing its ->readpage.
Provide shmem_read_mapping_page_gfp() now, ready for that transition,

Export shmem_read_mapping_page_gfp() and add it to list in shmem_fs.h,
with shmem_read_mapping_page() inline for the common mapping_gfp case.

(shmem_read_mapping_page_gfp or shmem_read_cache_page_gfp? Generally the
read_mapping_page functions use the mapping's ->readpage, and the
read_cache_page functions use the supplied filler, so I think
read_cache_page_gfp was slightly misnamed.)

Signed-off-by: Hugh Dickins 
Cc: Christoph Hellwig 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

tmpfs: take control of its truncate_range

2011-06-28T01:00:12+00:00

2.6.35's new truncate convention gave tmpfs the opportunity to control
its file truncation, no longer enforced from outside by vmtruncate().
We shall want to build upon that, to handle pagecache and swap together.

Slightly redefine the ->truncate_range interface: let it now be called
between the unmap_mapping_range()s, with the filesystem responsible for
doing the truncate_inode_pages_range() from it - just as the filesystem
is nowadays responsible for doing that from its ->setattr.

Let's rename shmem_notify_change() to shmem_setattr().  Instead of
calling the generic truncate_setsize(), bring that code in so we can
call shmem_truncate_range() - which will later be updated to perform its
own variant of truncate_inode_pages_range().

Remove the punch_hole unmap_mapping_range() from shmem_truncate_range():
now that the COW's unmap_mapping_range() comes after ->truncate_range,
there is no need to call it a third time.

Export shmem_truncate_range() and add it to the list in shmem_fs.h, so
that i915_gem_object_truncate() can call it explicitly in future; get
this patch in first, then update drm/i915 once this is available (until
then, i915 will just be doing the truncate_inode_pages() twice).

Though introduced five years ago, no other filesystem is implementing
->truncate_range, and its only other user is madvise(,,MADV_REMOVE): we
expect to convert it to fallocate(,FALLOC_FL_PUNCH_HOLE,,) shortly,
whereupon ->truncate_range can be removed from inode_operations -
shmem_truncate_range() will help i915 across that transition too.

Signed-off-by: Hugh Dickins 
Cc: Christoph Hellwig 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: move shmem prototypes to shmem_fs.h

2011-06-28T01:00:12+00:00

Before adding any more global entry points into shmem.c, gather such
prototypes into shmem_fs.h.  Remove mm's own declarations from swap.h,
but for now leave the ones in mm.h: because shmem_file_setup() and
shmem_zero_setup() are called from various places, and we should not
force other subsystems to update immediately.

Signed-off-by: Hugh Dickins 
Cc: Christoph Hellwig 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: move vmtruncate_range to truncate.c

2011-06-28T01:00:12+00:00

You would expect to find vmtruncate_range() next to vmtruncate() in
mm/truncate.c: move it there.

Signed-off-by: Hugh Dickins 
Acked-by: Christoph Hellwig 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, hotplug: protect zonelist building with zonelists_mutex

2011-06-23T04:06:48+00:00

Commit 959ecc48fc75 ("mm/memory_hotplug.c: fix building of node hotplug
zonelist") does not protect the build_all_zonelists() call with
zonelists_mutex as needed.  This can lead to races in constructing
zonelist ordering if a concurrent build is underway.  Protecting this
with lock_memory_hotplug() is insufficient since zonelists can be
rebuild though sysfs as well.

Signed-off-by: David Rientjes 
Reviewed-by: KOSAKI Motohiro 
Signed-off-by: Linus Torvalds

mm, hotplug: fix error handling in mem_online_node()

2011-06-23T04:06:47+00:00

The error handling in mem_online_node() is incorrect: hotadd_new_pgdat()
returns NULL if the new pgdat could not have been allocated and a pointer
to it otherwise.

mem_online_node() should fail if hotadd_new_pgdat() fails, not the
inverse.  This fixes an issue when memoryless nodes are not onlined and
their sysfs interface is not registered when their first cpu is brought
up.

The bug was introduced by commit cf23422b9d76 ("cpu/mem hotplug: enable
CPUs online before local memory online") iow v2.6.35.

Signed-off-by: David Rientjes 
Reviewed-by: KOSAKI Motohiro 
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

mm: avoid anon_vma_chain allocation under anon_vma lock

2011-06-18T02:24:11+00:00

Hugh Dickins points out that lockdep (correctly) spots a potential
deadlock on the anon_vma lock, because we now do a GFP_KERNEL allocation
of anon_vma_chain while doing anon_vma_clone().  The problem is that
page reclaim will want to take the anon_vma lock of any anonymous pages
that it will try to reclaim.

So re-organize the code in anon_vma_clone() slightly: first do just a
GFP_NOWAIT allocation, which will usually work fine.  But if that fails,
let's just drop the lock and re-do the allocation, now with GFP_KERNEL.

End result: not only do we avoid the locking problem, this also ends up
getting better concurrency in case the allocation does need to block.
Tim Chen reports that with all these anon_vma locking tweaks, we're now
almost back up to the spinlock performance.

Reported-and-tested-by: Hugh Dickins 
Tested-by: Tim Chen 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Signed-off-by: Linus Torvalds