linux.git/mm/Makefile, branch v3.15

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

2014-04-12T21:49:50+00:00

Pull vfs updates from Al Viro:
 "The first vfs pile, with deep apologies for being very late in this
  window.

  Assorted cleanups and fixes, plus a large preparatory part of iov_iter
  work.  There's a lot more of that, but it'll probably go into the next
  merge window - it *does* shape up nicely, removes a lot of
  boilerplate, gets rid of locking inconsistencie between aio_write and
  splice_write and I hope to get Kent's direct-io rewrite merged into
  the same queue, but some of the stuff after this point is having
  (mostly trivial) conflicts with the things already merged into
  mainline and with some I want more testing.

  This one passes LTP and xfstests without regressions, in addition to
  usual beating.  BTW, readahead02 in ltp syscalls testsuite has started
  giving failures since "mm/readahead.c: fix readahead failure for
  memoryless NUMA nodes and limit readahead pages" - might be a false
  positive, might be a real regression..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  missing bits of "splice: fix racy pipe->buffers uses"
  cifs: fix the race in cifs_writev()
  ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
  kill generic_file_buffered_write()
  ocfs2_file_aio_write(): switch to generic_perform_write()
  ceph_aio_write(): switch to generic_perform_write()
  xfs_file_buffered_aio_write(): switch to generic_perform_write()
  export generic_perform_write(), start getting rid of generic_file_buffer_write()
  generic_file_direct_write(): get rid of ppos argument
  btrfs_file_aio_write(): get rid of ppos
  kill the 5th argument of generic_file_buffered_write()
  kill the 4th argument of __generic_file_aio_write()
  lustre: don't open-code kernel_recvmsg()
  ocfs2: don't open-code kernel_recvmsg()
  drbd: don't open-code kernel_recvmsg()
  constify blk_rq_map_user_iov() and friends
  lustre: switch to kernel_sendmsg()
  ocfs2: don't open-code kernel_sendmsg()
  take iov_iter stuff to mm/iov_iter.c
  process_vm_access: tidy up a bit
  ...

mm: create generic early_ioremap() support

2014-04-07T23:36:15+00:00

This patch creates a generic implementation of early_ioremap() support
based on the existing x86 implementation.  early_ioremp() is useful for
early boot code which needs to temporarily map I/O or memory regions
before normal mapping functions such as ioremap() are available.

Some architectures have optional MMU.  In the no-MMU case, the remap
functions simply return the passed in physical address and the unmap
functions do nothing.

Signed-off-by: Mark Salter 
Acked-by: Catalin Marinas 
Acked-by: H. Peter Anvin 
Cc: Borislav Petkov 
Cc: Dave Young 
Cc: Will Deacon 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: per-thread vma caching

2014-04-07T23:35:53+00:00

This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed.  There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma().  Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.

We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality.  On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.

The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number.  The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed.  Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question.  Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:

1) System bootup: Most programs are single threaded, so the per-thread
   scheme does improve ~50% hit rate by just adding a few more slots to
   the cache.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 50.61%   | 19.90            |
| patched        | 73.45%   | 13.58            |
+----------------+----------+------------------+

2) Kernel build: This one is already pretty good with the current
   approach as we're dealing with good locality.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 75.28%   | 11.03            |
| patched        | 88.09%   | 9.31             |
+----------------+----------+------------------+

3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 70.66%   | 17.14            |
| patched        | 91.15%   | 12.57            |
+----------------+----------+------------------+

4) Ebizzy: There's a fair amount of variation from run to run, but this
   approach always shows nearly perfect hit rates, while baseline is just
   about non-existent.  The amounts of cycles can fluctuate between
   anywhere from ~60 to ~116 for the baseline scheme, but this approach
   reduces it considerably.  For instance, with 80 threads:

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 1.06%    | 91.54            |
| patched        | 99.97%   | 14.18            |
+----------------+----------+------------------+

[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso 
Reviewed-by: Rik van Riel 
Acked-by: Linus Torvalds 
Reviewed-by: Michel Lespinasse 
Cc: Oleg Nesterov 
Tested-by: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: thrash detection-based file cache sizing

2014-04-03T23:21:01+00:00

The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have shown
to benefit from caching in the past.  We call the recently usedbut
ultimately was not significantly better than a FIFO policy and still
thrashed cache based on eviction speed, rather than actual demand for
cache.

This patch solves one half of the problem by decoupling the ability to
detect working set changes from the inactive list size.  By maintaining
a history of recently evicted file pages it can detect frequently used
pages with an arbitrarily small inactive list size, and subsequently
apply pressure on the active list based on actual demand for cache, not
just overall eviction speed.

Every zone maintains a counter that tracks inactive list aging speed.
When a page is evicted, a snapshot of this counter is stored in the
now-empty page cache radix tree slot.  On refault, the minimum access
distance of the page can be assessed, to evaluate whether the page
should be part of the active list or not.

This fixes the VM's blindness towards working set changes in excess of
the inactive list.  And it's the foundation to further improve the
protection ability and reduce the minimum inactive list size of 50%.

Signed-off-by: Johannes Weiner 
Reviewed-by: Rik van Riel 
Reviewed-by: Minchan Kim 
Reviewed-by: Bob Liu 
Cc: Andrea Arcangeli 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: Greg Thelen 
Cc: Hugh Dickins 
Cc: Jan Kara 
Cc: KOSAKI Motohiro 
Cc: Luigi Semenzato 
Cc: Mel Gorman 
Cc: Metin Doslu 
Cc: Michel Lespinasse 
Cc: Ozgun Erdogan 
Cc: Peter Zijlstra 
Cc: Roman Gushchin 
Cc: Ryan Mallon 
Cc: Tejun Heo 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

take iov_iter stuff to mm/iov_iter.c

2014-04-02T03:19:30+00:00

Signed-off-by: Al Viro

zsmalloc: move it under mm

2014-01-31T00:56:55+00:00

This patch moves zsmalloc under mm directory.

Before that, description will explain why we have needed custom
allocator.

Zsmalloc is a new slab-based memory allocator for storing compressed
pages.  It is designed for low fragmentation and high allocation success
rate on large object, but <= PAGE_SIZE allocations.

zsmalloc differs from the kernel slab allocator in two primary ways to
achieve these design goals.

zsmalloc never requires high order page allocations to back slabs, or
"size classes" in zsmalloc terms.  Instead it allows multiple
single-order pages to be stitched together into a "zspage" which backs
the slab.  This allows for higher allocation success rate under memory
pressure.

Also, zsmalloc allows objects to span page boundaries within the zspage.
This allows for lower fragmentation than could be had with the kernel
slab allocator for objects between PAGE_SIZE/2 and PAGE_SIZE.  With the
kernel slab allocator, if a page compresses to 60% of it original size,
the memory savings gained through compression is lost in fragmentation
because another object of the same size can't be stored in the leftover
space.

This ability to span pages results in zsmalloc allocations not being
directly addressable by the user.  The user is given an
non-dereferencable handle in response to an allocation request.  That
handle must be mapped, using zs_map_object(), which returns a pointer to
the mapped region that can be used.  The mapping is necessary since the
object data may reside in two different noncontigious pages.

The zsmalloc fulfills the allocation needs for zram perfectly

[sjenning@linux.vnet.ibm.com: borrow Seth's quote]
Signed-off-by: Minchan Kim 
Acked-by: Nitin Gupta 
Reviewed-by: Konrad Rzeszutek Wilk 
Cc: Bob Liu 
Cc: Greg Kroah-Hartman 
Cc: Hugh Dickins 
Cc: Jens Axboe 
Cc: Luigi Semenzato 
Cc: Mel Gorman 
Cc: Pekka Enberg 
Cc: Rik van Riel 
Cc: Seth Jennings 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

list: add a new LRU list type

2013-09-10T22:56:30+00:00

Several subsystems use the same construct for LRU lists - a list head, a
spin lock and and item count.  They also use exactly the same code for
adding and removing items from the LRU.  Create a generic type for these
LRU lists.

This is the beginning of generic, node aware LRUs for shrinkers to work
with.

[glommer@openvz.org: enum defined constants for lru. Suggested by gthelen, don't relock over retry]
Signed-off-by: Dave Chinner 
Signed-off-by: Glauber Costa 
Reviewed-by: Greg Thelen 
Acked-by: Mel Gorman 
Cc: "Theodore Ts'o" 
Cc: Adrian Hunter 
Cc: Al Viro 
Cc: Artem Bityutskiy 
Cc: Arve Hjønnevåg 
Cc: Carlos Maiolino 
Cc: Christoph Hellwig 
Cc: Chuck Lever 
Cc: Daniel Vetter 
Cc: David Rientjes 
Cc: Gleb Natapov 
Cc: Greg Thelen 
Cc: J. Bruce Fields 
Cc: Jan Kara 
Cc: Jerome Glisse 
Cc: John Stultz 
Cc: KAMEZAWA Hiroyuki 
Cc: Kent Overstreet 
Cc: Kirill A. Shutemov 
Cc: Marcelo Tosatti 
Cc: Mel Gorman 
Cc: Steven Whitehouse 
Cc: Thomas Hellstrom 
Cc: Trond Myklebust 
Signed-off-by: Andrew Morton 

Signed-off-by: Al Viro

zswap: add to mm/

2013-07-11T01:11:34+00:00

zswap is a thin backend for frontswap that takes pages that are in the
process of being swapped out and attempts to compress them and store
them in a RAM-based memory pool.  This can result in a significant I/O
reduction on the swap device and, in the case where decompressing from
RAM is faster than reading from the swap device, can also improve
workload performance.

It also has support for evicting swap pages that are currently
compressed in zswap to the swap device on an LRU(ish) basis.  This
functionality makes zswap a true cache in that, once the cache is full,
the oldest pages can be moved out of zswap to the swap device so newer
pages can be compressed and stored in zswap.

This patch adds the zswap driver to mm/

Signed-off-by: Seth Jennings 
Acked-by: Rik van Riel 
Cc: Greg Kroah-Hartman 
Cc: Nitin Gupta 
Cc: Minchan Kim 
Cc: Konrad Rzeszutek Wilk 
Cc: Dan Magenheimer 
Cc: Robert Jennings 
Cc: Jenifer Hopper 
Cc: Mel Gorman 
Cc: Johannes Weiner 
Cc: Larry Woodman 
Cc: Benjamin Herrenschmidt 
Cc: Dave Hansen 
Cc: Joe Perches 
Cc: Joonsoo Kim 
Cc: Cody P Schafer 
Cc: Hugh Dickens 
Cc: Paul Mackerras 
Cc: Fengguang Wu 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

zbud: add to mm/

2013-07-11T01:11:34+00:00

zbud is an special purpose allocator for storing compressed pages.  It
is designed to store up to two compressed pages per physical page.
While this design limits storage density, it has simple and
deterministic reclaim properties that make it preferable to a higher
density approach when reclaim will be used.

zbud works by storing compressed pages, or "zpages", together in pairs
in a single memory page called a "zbud page".  The first buddy is "left
justifed" at the beginning of the zbud page, and the last buddy is
"right justified" at the end of the zbud page.  The benefit is that if
either buddy is freed, the freed buddy space, coalesced with whatever
slack space that existed between the buddies, results in the largest
possible free region within the zbud page.

zbud also provides an attractive lower bound on density.  The ratio of
zpages to zbud pages can not be less than 1.  This ensures that zbud can
never "do harm" by using more pages to store zpages than the
uncompressed zpages would have used on their own.

This implementation is a rewrite of the zbud allocator internally used
by zcache in the driver/staging tree.  The rewrite was necessary to
remove some of the zcache specific elements that were ingrained
throughout and provide a generic allocation interface that can later be
used by zsmalloc and others.

This patch adds zbud to mm/ for later use by zswap.

Signed-off-by: Seth Jennings 
Acked-by: Rik van Riel 
Cc: Greg Kroah-Hartman 
Cc: Nitin Gupta 
Cc: Minchan Kim 
Cc: Konrad Rzeszutek Wilk 
Cc: Dan Magenheimer 
Cc: Robert Jennings 
Cc: Jenifer Hopper 
Cc: Mel Gorman 
Cc: Johannes Weiner 
Cc: Larry Woodman 
Cc: Benjamin Herrenschmidt 
Cc: Dave Hansen 
Cc: Joe Perches 
Cc: Joonsoo Kim 
Cc: Cody P Schafer 
Cc: Hugh Dickens 
Cc: Paul Mackerras 
Cc: Bob Liu 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg: add memory.pressure_level events

2013-04-29T22:54:38+00:00

With this patch userland applications that want to maintain the
interactivity/memory allocation cost can use the pressure level
notifications.  The levels are defined like this:

The "low" level means that the system is reclaiming memory for new
allocations.  Monitoring this reclaiming activity might be useful for
maintaining cache level.  Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).

The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file
caches, etc.  Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.

The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger.  Applications should do whatever they can to help the
system.  It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.

The events are propagated upward until the event is handled, i.e.  the
events are not pass-through.  Here is what this means: for example you
have three cgroups: A->B->C.  Now you set up an event listener on
cgroups A, B and C, and suppose group C experiences some pressure.  In
this situation, only group C will receive the notification, i.e.  groups
A and B will not receive it.  This is done to avoid excessive
"broadcasting" of messages, which disturbs the system and which is
especially bad if we are low on memory or thrashing.  So, organize the
cgroups wisely, or propagate the events manually (or, ask us to
implement the pass-through events, explaining why would you need them.)

Performance wise, the memory pressure notifications feature itself is
lightweight and does not require much of bookkeeping, in contrast to the
rest of memcg features.  Unfortunately, as of current memcg
implementation, pages accounting is an inseparable part and cannot be
turned off.  The good news is that there are some efforts[1] to improve
the situation; plus, implementing the same, fully API-compatible[2]
interface for CONFIG_MEMCG=n case (e.g.  embedded) is also a viable
option, so it will not require any changes on the userland side.

[1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
[2] http://lkml.org/lkml/2013/2/21/454

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
Signed-off-by: Anton Vorontsov 
Acked-by: Kirill A. Shutemov 
Acked-by: KAMEZAWA Hiroyuki 
Cc: Tejun Heo 
Cc: David Rientjes 
Cc: Pekka Enberg 
Cc: Mel Gorman 
Cc: Glauber Costa 
Cc: Michal Hocko 
Cc: Luiz Capitulino 
Cc: Greg Thelen 
Cc: Leonid Moiseichuk 
Cc: KOSAKI Motohiro 
Cc: Minchan Kim 
Cc: Bartlomiej Zolnierkiewicz 
Cc: John Stultz 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds