<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/mm/Makefile, branch v3.15</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs</title>
<updated>2014-04-12T21:49:50+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2014-04-12T21:49:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=5166701b368caea89d57b14bf41cf39e819dad51'/>
<id>5166701b368caea89d57b14bf41cf39e819dad51</id>
<content type='text'>
Pull vfs updates from Al Viro:
 "The first vfs pile, with deep apologies for being very late in this
  window.

  Assorted cleanups and fixes, plus a large preparatory part of iov_iter
  work.  There's a lot more of that, but it'll probably go into the next
  merge window - it *does* shape up nicely, removes a lot of
  boilerplate, gets rid of locking inconsistencie between aio_write and
  splice_write and I hope to get Kent's direct-io rewrite merged into
  the same queue, but some of the stuff after this point is having
  (mostly trivial) conflicts with the things already merged into
  mainline and with some I want more testing.

  This one passes LTP and xfstests without regressions, in addition to
  usual beating.  BTW, readahead02 in ltp syscalls testsuite has started
  giving failures since "mm/readahead.c: fix readahead failure for
  memoryless NUMA nodes and limit readahead pages" - might be a false
  positive, might be a real regression..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  missing bits of "splice: fix racy pipe-&gt;buffers uses"
  cifs: fix the race in cifs_writev()
  ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
  kill generic_file_buffered_write()
  ocfs2_file_aio_write(): switch to generic_perform_write()
  ceph_aio_write(): switch to generic_perform_write()
  xfs_file_buffered_aio_write(): switch to generic_perform_write()
  export generic_perform_write(), start getting rid of generic_file_buffer_write()
  generic_file_direct_write(): get rid of ppos argument
  btrfs_file_aio_write(): get rid of ppos
  kill the 5th argument of generic_file_buffered_write()
  kill the 4th argument of __generic_file_aio_write()
  lustre: don't open-code kernel_recvmsg()
  ocfs2: don't open-code kernel_recvmsg()
  drbd: don't open-code kernel_recvmsg()
  constify blk_rq_map_user_iov() and friends
  lustre: switch to kernel_sendmsg()
  ocfs2: don't open-code kernel_sendmsg()
  take iov_iter stuff to mm/iov_iter.c
  process_vm_access: tidy up a bit
  ...
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull vfs updates from Al Viro:
 "The first vfs pile, with deep apologies for being very late in this
  window.

  Assorted cleanups and fixes, plus a large preparatory part of iov_iter
  work.  There's a lot more of that, but it'll probably go into the next
  merge window - it *does* shape up nicely, removes a lot of
  boilerplate, gets rid of locking inconsistencie between aio_write and
  splice_write and I hope to get Kent's direct-io rewrite merged into
  the same queue, but some of the stuff after this point is having
  (mostly trivial) conflicts with the things already merged into
  mainline and with some I want more testing.

  This one passes LTP and xfstests without regressions, in addition to
  usual beating.  BTW, readahead02 in ltp syscalls testsuite has started
  giving failures since "mm/readahead.c: fix readahead failure for
  memoryless NUMA nodes and limit readahead pages" - might be a false
  positive, might be a real regression..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  missing bits of "splice: fix racy pipe-&gt;buffers uses"
  cifs: fix the race in cifs_writev()
  ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
  kill generic_file_buffered_write()
  ocfs2_file_aio_write(): switch to generic_perform_write()
  ceph_aio_write(): switch to generic_perform_write()
  xfs_file_buffered_aio_write(): switch to generic_perform_write()
  export generic_perform_write(), start getting rid of generic_file_buffer_write()
  generic_file_direct_write(): get rid of ppos argument
  btrfs_file_aio_write(): get rid of ppos
  kill the 5th argument of generic_file_buffered_write()
  kill the 4th argument of __generic_file_aio_write()
  lustre: don't open-code kernel_recvmsg()
  ocfs2: don't open-code kernel_recvmsg()
  drbd: don't open-code kernel_recvmsg()
  constify blk_rq_map_user_iov() and friends
  lustre: switch to kernel_sendmsg()
  ocfs2: don't open-code kernel_sendmsg()
  take iov_iter stuff to mm/iov_iter.c
  process_vm_access: tidy up a bit
  ...
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: create generic early_ioremap() support</title>
<updated>2014-04-07T23:36:15+00:00</updated>
<author>
<name>Mark Salter</name>
<email>msalter@redhat.com</email>
</author>
<published>2014-04-07T22:39:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=9e5c33d7aeeef62e5fa7e74f94432685bd03026b'/>
<id>9e5c33d7aeeef62e5fa7e74f94432685bd03026b</id>
<content type='text'>
This patch creates a generic implementation of early_ioremap() support
based on the existing x86 implementation.  early_ioremp() is useful for
early boot code which needs to temporarily map I/O or memory regions
before normal mapping functions such as ioremap() are available.

Some architectures have optional MMU.  In the no-MMU case, the remap
functions simply return the passed in physical address and the unmap
functions do nothing.

Signed-off-by: Mark Salter &lt;msalter@redhat.com&gt;
Acked-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Acked-by: H. Peter Anvin &lt;hpa@zytor.com&gt;
Cc: Borislav Petkov &lt;borislav.petkov@amd.com&gt;
Cc: Dave Young &lt;dyoung@redhat.com&gt;
Cc: Will Deacon &lt;will.deacon@arm.com&gt;
Cc: Ingo Molnar &lt;mingo@elte.hu&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This patch creates a generic implementation of early_ioremap() support
based on the existing x86 implementation.  early_ioremp() is useful for
early boot code which needs to temporarily map I/O or memory regions
before normal mapping functions such as ioremap() are available.

Some architectures have optional MMU.  In the no-MMU case, the remap
functions simply return the passed in physical address and the unmap
functions do nothing.

Signed-off-by: Mark Salter &lt;msalter@redhat.com&gt;
Acked-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Acked-by: H. Peter Anvin &lt;hpa@zytor.com&gt;
Cc: Borislav Petkov &lt;borislav.petkov@amd.com&gt;
Cc: Dave Young &lt;dyoung@redhat.com&gt;
Cc: Will Deacon &lt;will.deacon@arm.com&gt;
Cc: Ingo Molnar &lt;mingo@elte.hu&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: per-thread vma caching</title>
<updated>2014-04-07T23:35:53+00:00</updated>
<author>
<name>Davidlohr Bueso</name>
<email>davidlohr@hp.com</email>
</author>
<published>2014-04-07T22:37:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=615d6e8756c87149f2d4c1b93d471bca002bd849'/>
<id>615d6e8756c87149f2d4c1b93d471bca002bd849</id>
<content type='text'>
This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed.  There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma().  Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.

We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality.  On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.

The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number.  The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed.  Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question.  Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:

1) System bootup: Most programs are single threaded, so the per-thread
   scheme does improve ~50% hit rate by just adding a few more slots to
   the cache.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 50.61%   | 19.90            |
| patched        | 73.45%   | 13.58            |
+----------------+----------+------------------+

2) Kernel build: This one is already pretty good with the current
   approach as we're dealing with good locality.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 75.28%   | 11.03            |
| patched        | 88.09%   | 9.31             |
+----------------+----------+------------------+

3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 70.66%   | 17.14            |
| patched        | 91.15%   | 12.57            |
+----------------+----------+------------------+

4) Ebizzy: There's a fair amount of variation from run to run, but this
   approach always shows nearly perfect hit rates, while baseline is just
   about non-existent.  The amounts of cycles can fluctuate between
   anywhere from ~60 to ~116 for the baseline scheme, but this approach
   reduces it considerably.  For instance, with 80 threads:

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 1.06%    | 91.54            |
| patched        | 99.97%   | 14.18            |
+----------------+----------+------------------+

[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso &lt;davidlohr@hp.com&gt;
Reviewed-by: Rik van Riel &lt;riel@redhat.com&gt;
Acked-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Reviewed-by: Michel Lespinasse &lt;walken@google.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Tested-by: Hugh Dickins &lt;hughd@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed.  There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma().  Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.

We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality.  On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.

The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number.  The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed.  Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question.  Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:

1) System bootup: Most programs are single threaded, so the per-thread
   scheme does improve ~50% hit rate by just adding a few more slots to
   the cache.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 50.61%   | 19.90            |
| patched        | 73.45%   | 13.58            |
+----------------+----------+------------------+

2) Kernel build: This one is already pretty good with the current
   approach as we're dealing with good locality.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 75.28%   | 11.03            |
| patched        | 88.09%   | 9.31             |
+----------------+----------+------------------+

3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 70.66%   | 17.14            |
| patched        | 91.15%   | 12.57            |
+----------------+----------+------------------+

4) Ebizzy: There's a fair amount of variation from run to run, but this
   approach always shows nearly perfect hit rates, while baseline is just
   about non-existent.  The amounts of cycles can fluctuate between
   anywhere from ~60 to ~116 for the baseline scheme, but this approach
   reduces it considerably.  For instance, with 80 threads:

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 1.06%    | 91.54            |
| patched        | 99.97%   | 14.18            |
+----------------+----------+------------------+

[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso &lt;davidlohr@hp.com&gt;
Reviewed-by: Rik van Riel &lt;riel@redhat.com&gt;
Acked-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Reviewed-by: Michel Lespinasse &lt;walken@google.com&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Tested-by: Hugh Dickins &lt;hughd@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: thrash detection-based file cache sizing</title>
<updated>2014-04-03T23:21:01+00:00</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2014-04-03T21:47:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=a528910e12ec7ee203095eb1711468a66b9b60b0'/>
<id>a528910e12ec7ee203095eb1711468a66b9b60b0</id>
<content type='text'>
The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have shown
to benefit from caching in the past.  We call the recently usedbut
ultimately was not significantly better than a FIFO policy and still
thrashed cache based on eviction speed, rather than actual demand for
cache.

This patch solves one half of the problem by decoupling the ability to
detect working set changes from the inactive list size.  By maintaining
a history of recently evicted file pages it can detect frequently used
pages with an arbitrarily small inactive list size, and subsequently
apply pressure on the active list based on actual demand for cache, not
just overall eviction speed.

Every zone maintains a counter that tracks inactive list aging speed.
When a page is evicted, a snapshot of this counter is stored in the
now-empty page cache radix tree slot.  On refault, the minimum access
distance of the page can be assessed, to evaluate whether the page
should be part of the active list or not.

This fixes the VM's blindness towards working set changes in excess of
the inactive list.  And it's the foundation to further improve the
protection ability and reduce the minimum inactive list size of 50%.

Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reviewed-by: Rik van Riel &lt;riel@redhat.com&gt;
Reviewed-by: Minchan Kim &lt;minchan@kernel.org&gt;
Reviewed-by: Bob Liu &lt;bob.liu@oracle.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Luigi Semenzato &lt;semenzato@google.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Metin Doslu &lt;metin@citusdata.com&gt;
Cc: Michel Lespinasse &lt;walken@google.com&gt;
Cc: Ozgun Erdogan &lt;ozgun@citusdata.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Roman Gushchin &lt;klamm@yandex-team.ru&gt;
Cc: Ryan Mallon &lt;rmallon@gmail.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have shown
to benefit from caching in the past.  We call the recently usedbut
ultimately was not significantly better than a FIFO policy and still
thrashed cache based on eviction speed, rather than actual demand for
cache.

This patch solves one half of the problem by decoupling the ability to
detect working set changes from the inactive list size.  By maintaining
a history of recently evicted file pages it can detect frequently used
pages with an arbitrarily small inactive list size, and subsequently
apply pressure on the active list based on actual demand for cache, not
just overall eviction speed.

Every zone maintains a counter that tracks inactive list aging speed.
When a page is evicted, a snapshot of this counter is stored in the
now-empty page cache radix tree slot.  On refault, the minimum access
distance of the page can be assessed, to evaluate whether the page
should be part of the active list or not.

This fixes the VM's blindness towards working set changes in excess of
the inactive list.  And it's the foundation to further improve the
protection ability and reduce the minimum inactive list size of 50%.

Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reviewed-by: Rik van Riel &lt;riel@redhat.com&gt;
Reviewed-by: Minchan Kim &lt;minchan@kernel.org&gt;
Reviewed-by: Bob Liu &lt;bob.liu@oracle.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Luigi Semenzato &lt;semenzato@google.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Metin Doslu &lt;metin@citusdata.com&gt;
Cc: Michel Lespinasse &lt;walken@google.com&gt;
Cc: Ozgun Erdogan &lt;ozgun@citusdata.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Roman Gushchin &lt;klamm@yandex-team.ru&gt;
Cc: Ryan Mallon &lt;rmallon@gmail.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>take iov_iter stuff to mm/iov_iter.c</title>
<updated>2014-04-02T03:19:30+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2014-02-06T00:11:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=4f18cd317a118c28482f97303600a2fe2ada6c79'/>
<id>4f18cd317a118c28482f97303600a2fe2ada6c79</id>
<content type='text'>
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>zsmalloc: move it under mm</title>
<updated>2014-01-31T00:56:55+00:00</updated>
<author>
<name>Minchan Kim</name>
<email>minchan@kernel.org</email>
</author>
<published>2014-01-30T23:45:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=bcf1647d0899666f0fb90d176abf63bae22abb7c'/>
<id>bcf1647d0899666f0fb90d176abf63bae22abb7c</id>
<content type='text'>
This patch moves zsmalloc under mm directory.

Before that, description will explain why we have needed custom
allocator.

Zsmalloc is a new slab-based memory allocator for storing compressed
pages.  It is designed for low fragmentation and high allocation success
rate on large object, but &lt;= PAGE_SIZE allocations.

zsmalloc differs from the kernel slab allocator in two primary ways to
achieve these design goals.

zsmalloc never requires high order page allocations to back slabs, or
"size classes" in zsmalloc terms.  Instead it allows multiple
single-order pages to be stitched together into a "zspage" which backs
the slab.  This allows for higher allocation success rate under memory
pressure.

Also, zsmalloc allows objects to span page boundaries within the zspage.
This allows for lower fragmentation than could be had with the kernel
slab allocator for objects between PAGE_SIZE/2 and PAGE_SIZE.  With the
kernel slab allocator, if a page compresses to 60% of it original size,
the memory savings gained through compression is lost in fragmentation
because another object of the same size can't be stored in the leftover
space.

This ability to span pages results in zsmalloc allocations not being
directly addressable by the user.  The user is given an
non-dereferencable handle in response to an allocation request.  That
handle must be mapped, using zs_map_object(), which returns a pointer to
the mapped region that can be used.  The mapping is necessary since the
object data may reside in two different noncontigious pages.

The zsmalloc fulfills the allocation needs for zram perfectly

[sjenning@linux.vnet.ibm.com: borrow Seth's quote]
Signed-off-by: Minchan Kim &lt;minchan@kernel.org&gt;
Acked-by: Nitin Gupta &lt;ngupta@vflare.org&gt;
Reviewed-by: Konrad Rzeszutek Wilk &lt;konrad.wilk@oracle.com&gt;
Cc: Bob Liu &lt;bob.liu@oracle.com&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Luigi Semenzato &lt;semenzato@google.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Pekka Enberg &lt;penberg@kernel.org&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Seth Jennings &lt;sjenning@linux.vnet.ibm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This patch moves zsmalloc under mm directory.

Before that, description will explain why we have needed custom
allocator.

Zsmalloc is a new slab-based memory allocator for storing compressed
pages.  It is designed for low fragmentation and high allocation success
rate on large object, but &lt;= PAGE_SIZE allocations.

zsmalloc differs from the kernel slab allocator in two primary ways to
achieve these design goals.

zsmalloc never requires high order page allocations to back slabs, or
"size classes" in zsmalloc terms.  Instead it allows multiple
single-order pages to be stitched together into a "zspage" which backs
the slab.  This allows for higher allocation success rate under memory
pressure.

Also, zsmalloc allows objects to span page boundaries within the zspage.
This allows for lower fragmentation than could be had with the kernel
slab allocator for objects between PAGE_SIZE/2 and PAGE_SIZE.  With the
kernel slab allocator, if a page compresses to 60% of it original size,
the memory savings gained through compression is lost in fragmentation
because another object of the same size can't be stored in the leftover
space.

This ability to span pages results in zsmalloc allocations not being
directly addressable by the user.  The user is given an
non-dereferencable handle in response to an allocation request.  That
handle must be mapped, using zs_map_object(), which returns a pointer to
the mapped region that can be used.  The mapping is necessary since the
object data may reside in two different noncontigious pages.

The zsmalloc fulfills the allocation needs for zram perfectly

[sjenning@linux.vnet.ibm.com: borrow Seth's quote]
Signed-off-by: Minchan Kim &lt;minchan@kernel.org&gt;
Acked-by: Nitin Gupta &lt;ngupta@vflare.org&gt;
Reviewed-by: Konrad Rzeszutek Wilk &lt;konrad.wilk@oracle.com&gt;
Cc: Bob Liu &lt;bob.liu@oracle.com&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Luigi Semenzato &lt;semenzato@google.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Pekka Enberg &lt;penberg@kernel.org&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Seth Jennings &lt;sjenning@linux.vnet.ibm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>list: add a new LRU list type</title>
<updated>2013-09-10T22:56:30+00:00</updated>
<author>
<name>Dave Chinner</name>
<email>dchinner@redhat.com</email>
</author>
<published>2013-08-28T00:17:58+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=a38e40824844a5ec85f3ea95632be953477d2afa'/>
<id>a38e40824844a5ec85f3ea95632be953477d2afa</id>
<content type='text'>
Several subsystems use the same construct for LRU lists - a list head, a
spin lock and and item count.  They also use exactly the same code for
adding and removing items from the LRU.  Create a generic type for these
LRU lists.

This is the beginning of generic, node aware LRUs for shrinkers to work
with.

[glommer@openvz.org: enum defined constants for lru. Suggested by gthelen, don't relock over retry]
Signed-off-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Signed-off-by: Glauber Costa &lt;glommer@openvz.org&gt;
Reviewed-by: Greg Thelen &lt;gthelen@google.com&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: "Theodore Ts'o" &lt;tytso@mit.edu&gt;
Cc: Adrian Hunter &lt;adrian.hunter@intel.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Artem Bityutskiy &lt;artem.bityutskiy@linux.intel.com&gt;
Cc: Arve Hjønnevåg &lt;arve@android.com&gt;
Cc: Carlos Maiolino &lt;cmaiolino@redhat.com&gt;
Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Cc: Chuck Lever &lt;chuck.lever@oracle.com&gt;
Cc: Daniel Vetter &lt;daniel.vetter@ffwll.ch&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Gleb Natapov &lt;gleb@redhat.com&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Cc: J. Bruce Fields &lt;bfields@redhat.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Jerome Glisse &lt;jglisse@redhat.com&gt;
Cc: John Stultz &lt;john.stultz@linaro.org&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Kent Overstreet &lt;koverstreet@google.com&gt;
Cc: Kirill A. Shutemov &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Marcelo Tosatti &lt;mtosatti@redhat.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Steven Whitehouse &lt;swhiteho@redhat.com&gt;
Cc: Thomas Hellstrom &lt;thellstrom@vmware.com&gt;
Cc: Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Several subsystems use the same construct for LRU lists - a list head, a
spin lock and and item count.  They also use exactly the same code for
adding and removing items from the LRU.  Create a generic type for these
LRU lists.

This is the beginning of generic, node aware LRUs for shrinkers to work
with.

[glommer@openvz.org: enum defined constants for lru. Suggested by gthelen, don't relock over retry]
Signed-off-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Signed-off-by: Glauber Costa &lt;glommer@openvz.org&gt;
Reviewed-by: Greg Thelen &lt;gthelen@google.com&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: "Theodore Ts'o" &lt;tytso@mit.edu&gt;
Cc: Adrian Hunter &lt;adrian.hunter@intel.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Artem Bityutskiy &lt;artem.bityutskiy@linux.intel.com&gt;
Cc: Arve Hjønnevåg &lt;arve@android.com&gt;
Cc: Carlos Maiolino &lt;cmaiolino@redhat.com&gt;
Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Cc: Chuck Lever &lt;chuck.lever@oracle.com&gt;
Cc: Daniel Vetter &lt;daniel.vetter@ffwll.ch&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Gleb Natapov &lt;gleb@redhat.com&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Cc: J. Bruce Fields &lt;bfields@redhat.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Jerome Glisse &lt;jglisse@redhat.com&gt;
Cc: John Stultz &lt;john.stultz@linaro.org&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Kent Overstreet &lt;koverstreet@google.com&gt;
Cc: Kirill A. Shutemov &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Marcelo Tosatti &lt;mtosatti@redhat.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Steven Whitehouse &lt;swhiteho@redhat.com&gt;
Cc: Thomas Hellstrom &lt;thellstrom@vmware.com&gt;
Cc: Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>zswap: add to mm/</title>
<updated>2013-07-11T01:11:34+00:00</updated>
<author>
<name>Seth Jennings</name>
<email>sjenning@linux.vnet.ibm.com</email>
</author>
<published>2013-07-10T23:05:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=2b2811178e85553405b86e3fe78357b9b95889ce'/>
<id>2b2811178e85553405b86e3fe78357b9b95889ce</id>
<content type='text'>
zswap is a thin backend for frontswap that takes pages that are in the
process of being swapped out and attempts to compress them and store
them in a RAM-based memory pool.  This can result in a significant I/O
reduction on the swap device and, in the case where decompressing from
RAM is faster than reading from the swap device, can also improve
workload performance.

It also has support for evicting swap pages that are currently
compressed in zswap to the swap device on an LRU(ish) basis.  This
functionality makes zswap a true cache in that, once the cache is full,
the oldest pages can be moved out of zswap to the swap device so newer
pages can be compressed and stored in zswap.

This patch adds the zswap driver to mm/

Signed-off-by: Seth Jennings &lt;sjenning@linux.vnet.ibm.com&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: Nitin Gupta &lt;ngupta@vflare.org&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Konrad Rzeszutek Wilk &lt;konrad.wilk@oracle.com&gt;
Cc: Dan Magenheimer &lt;dan.magenheimer@oracle.com&gt;
Cc: Robert Jennings &lt;rcj@linux.vnet.ibm.com&gt;
Cc: Jenifer Hopper &lt;jhopper@us.ibm.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Johannes Weiner &lt;jweiner@redhat.com&gt;
Cc: Larry Woodman &lt;lwoodman@redhat.com&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Dave Hansen &lt;dave@sr71.net&gt;
Cc: Joe Perches &lt;joe@perches.com&gt;
Cc: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Cc: Cody P Schafer &lt;cody@linux.vnet.ibm.com&gt;
Cc: Hugh Dickens &lt;hughd@google.com&gt;
Cc: Paul Mackerras &lt;paulus@samba.org&gt;
Cc: Fengguang Wu &lt;fengguang.wu@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
zswap is a thin backend for frontswap that takes pages that are in the
process of being swapped out and attempts to compress them and store
them in a RAM-based memory pool.  This can result in a significant I/O
reduction on the swap device and, in the case where decompressing from
RAM is faster than reading from the swap device, can also improve
workload performance.

It also has support for evicting swap pages that are currently
compressed in zswap to the swap device on an LRU(ish) basis.  This
functionality makes zswap a true cache in that, once the cache is full,
the oldest pages can be moved out of zswap to the swap device so newer
pages can be compressed and stored in zswap.

This patch adds the zswap driver to mm/

Signed-off-by: Seth Jennings &lt;sjenning@linux.vnet.ibm.com&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: Nitin Gupta &lt;ngupta@vflare.org&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Konrad Rzeszutek Wilk &lt;konrad.wilk@oracle.com&gt;
Cc: Dan Magenheimer &lt;dan.magenheimer@oracle.com&gt;
Cc: Robert Jennings &lt;rcj@linux.vnet.ibm.com&gt;
Cc: Jenifer Hopper &lt;jhopper@us.ibm.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Johannes Weiner &lt;jweiner@redhat.com&gt;
Cc: Larry Woodman &lt;lwoodman@redhat.com&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Dave Hansen &lt;dave@sr71.net&gt;
Cc: Joe Perches &lt;joe@perches.com&gt;
Cc: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Cc: Cody P Schafer &lt;cody@linux.vnet.ibm.com&gt;
Cc: Hugh Dickens &lt;hughd@google.com&gt;
Cc: Paul Mackerras &lt;paulus@samba.org&gt;
Cc: Fengguang Wu &lt;fengguang.wu@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>zbud: add to mm/</title>
<updated>2013-07-11T01:11:34+00:00</updated>
<author>
<name>Seth Jennings</name>
<email>sjenning@linux.vnet.ibm.com</email>
</author>
<published>2013-07-10T23:04:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=4e2e2770b1529edc5849c86b29a6febe27e2f083'/>
<id>4e2e2770b1529edc5849c86b29a6febe27e2f083</id>
<content type='text'>
zbud is an special purpose allocator for storing compressed pages.  It
is designed to store up to two compressed pages per physical page.
While this design limits storage density, it has simple and
deterministic reclaim properties that make it preferable to a higher
density approach when reclaim will be used.

zbud works by storing compressed pages, or "zpages", together in pairs
in a single memory page called a "zbud page".  The first buddy is "left
justifed" at the beginning of the zbud page, and the last buddy is
"right justified" at the end of the zbud page.  The benefit is that if
either buddy is freed, the freed buddy space, coalesced with whatever
slack space that existed between the buddies, results in the largest
possible free region within the zbud page.

zbud also provides an attractive lower bound on density.  The ratio of
zpages to zbud pages can not be less than 1.  This ensures that zbud can
never "do harm" by using more pages to store zpages than the
uncompressed zpages would have used on their own.

This implementation is a rewrite of the zbud allocator internally used
by zcache in the driver/staging tree.  The rewrite was necessary to
remove some of the zcache specific elements that were ingrained
throughout and provide a generic allocation interface that can later be
used by zsmalloc and others.

This patch adds zbud to mm/ for later use by zswap.

Signed-off-by: Seth Jennings &lt;sjenning@linux.vnet.ibm.com&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: Nitin Gupta &lt;ngupta@vflare.org&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Konrad Rzeszutek Wilk &lt;konrad.wilk@oracle.com&gt;
Cc: Dan Magenheimer &lt;dan.magenheimer@oracle.com&gt;
Cc: Robert Jennings &lt;rcj@linux.vnet.ibm.com&gt;
Cc: Jenifer Hopper &lt;jhopper@us.ibm.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Johannes Weiner &lt;jweiner@redhat.com&gt;
Cc: Larry Woodman &lt;lwoodman@redhat.com&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Dave Hansen &lt;dave@sr71.net&gt;
Cc: Joe Perches &lt;joe@perches.com&gt;
Cc: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Cc: Cody P Schafer &lt;cody@linux.vnet.ibm.com&gt;
Cc: Hugh Dickens &lt;hughd@google.com&gt;
Cc: Paul Mackerras &lt;paulus@samba.org&gt;
Cc: Bob Liu &lt;bob.liu@oracle.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
zbud is an special purpose allocator for storing compressed pages.  It
is designed to store up to two compressed pages per physical page.
While this design limits storage density, it has simple and
deterministic reclaim properties that make it preferable to a higher
density approach when reclaim will be used.

zbud works by storing compressed pages, or "zpages", together in pairs
in a single memory page called a "zbud page".  The first buddy is "left
justifed" at the beginning of the zbud page, and the last buddy is
"right justified" at the end of the zbud page.  The benefit is that if
either buddy is freed, the freed buddy space, coalesced with whatever
slack space that existed between the buddies, results in the largest
possible free region within the zbud page.

zbud also provides an attractive lower bound on density.  The ratio of
zpages to zbud pages can not be less than 1.  This ensures that zbud can
never "do harm" by using more pages to store zpages than the
uncompressed zpages would have used on their own.

This implementation is a rewrite of the zbud allocator internally used
by zcache in the driver/staging tree.  The rewrite was necessary to
remove some of the zcache specific elements that were ingrained
throughout and provide a generic allocation interface that can later be
used by zsmalloc and others.

This patch adds zbud to mm/ for later use by zswap.

Signed-off-by: Seth Jennings &lt;sjenning@linux.vnet.ibm.com&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: Nitin Gupta &lt;ngupta@vflare.org&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Konrad Rzeszutek Wilk &lt;konrad.wilk@oracle.com&gt;
Cc: Dan Magenheimer &lt;dan.magenheimer@oracle.com&gt;
Cc: Robert Jennings &lt;rcj@linux.vnet.ibm.com&gt;
Cc: Jenifer Hopper &lt;jhopper@us.ibm.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Johannes Weiner &lt;jweiner@redhat.com&gt;
Cc: Larry Woodman &lt;lwoodman@redhat.com&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Dave Hansen &lt;dave@sr71.net&gt;
Cc: Joe Perches &lt;joe@perches.com&gt;
Cc: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Cc: Cody P Schafer &lt;cody@linux.vnet.ibm.com&gt;
Cc: Hugh Dickens &lt;hughd@google.com&gt;
Cc: Paul Mackerras &lt;paulus@samba.org&gt;
Cc: Bob Liu &lt;bob.liu@oracle.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>memcg: add memory.pressure_level events</title>
<updated>2013-04-29T22:54:38+00:00</updated>
<author>
<name>Anton Vorontsov</name>
<email>anton.vorontsov@linaro.org</email>
</author>
<published>2013-04-29T22:08:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=70ddf637eebe47e61fb2be08a59315581b6d2f38'/>
<id>70ddf637eebe47e61fb2be08a59315581b6d2f38</id>
<content type='text'>
With this patch userland applications that want to maintain the
interactivity/memory allocation cost can use the pressure level
notifications.  The levels are defined like this:

The "low" level means that the system is reclaiming memory for new
allocations.  Monitoring this reclaiming activity might be useful for
maintaining cache level.  Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).

The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file
caches, etc.  Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.

The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger.  Applications should do whatever they can to help the
system.  It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.

The events are propagated upward until the event is handled, i.e.  the
events are not pass-through.  Here is what this means: for example you
have three cgroups: A-&gt;B-&gt;C.  Now you set up an event listener on
cgroups A, B and C, and suppose group C experiences some pressure.  In
this situation, only group C will receive the notification, i.e.  groups
A and B will not receive it.  This is done to avoid excessive
"broadcasting" of messages, which disturbs the system and which is
especially bad if we are low on memory or thrashing.  So, organize the
cgroups wisely, or propagate the events manually (or, ask us to
implement the pass-through events, explaining why would you need them.)

Performance wise, the memory pressure notifications feature itself is
lightweight and does not require much of bookkeeping, in contrast to the
rest of memcg features.  Unfortunately, as of current memcg
implementation, pages accounting is an inseparable part and cannot be
turned off.  The good news is that there are some efforts[1] to improve
the situation; plus, implementing the same, fully API-compatible[2]
interface for CONFIG_MEMCG=n case (e.g.  embedded) is also a viable
option, so it will not require any changes on the userland side.

[1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
[2] http://lkml.org/lkml/2013/2/21/454

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
Signed-off-by: Anton Vorontsov &lt;anton.vorontsov@linaro.org&gt;
Acked-by: Kirill A. Shutemov &lt;kirill@shutemov.name&gt;
Acked-by: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Pekka Enberg &lt;penberg@kernel.org&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Glauber Costa &lt;glommer@parallels.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Luiz Capitulino &lt;lcapitulino@redhat.com&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Cc: Leonid Moiseichuk &lt;leonid.moiseichuk@nokia.com&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@gmail.com&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Bartlomiej Zolnierkiewicz &lt;b.zolnierkie@samsung.com&gt;
Cc: John Stultz &lt;john.stultz@linaro.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
With this patch userland applications that want to maintain the
interactivity/memory allocation cost can use the pressure level
notifications.  The levels are defined like this:

The "low" level means that the system is reclaiming memory for new
allocations.  Monitoring this reclaiming activity might be useful for
maintaining cache level.  Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).

The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file
caches, etc.  Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.

The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger.  Applications should do whatever they can to help the
system.  It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.

The events are propagated upward until the event is handled, i.e.  the
events are not pass-through.  Here is what this means: for example you
have three cgroups: A-&gt;B-&gt;C.  Now you set up an event listener on
cgroups A, B and C, and suppose group C experiences some pressure.  In
this situation, only group C will receive the notification, i.e.  groups
A and B will not receive it.  This is done to avoid excessive
"broadcasting" of messages, which disturbs the system and which is
especially bad if we are low on memory or thrashing.  So, organize the
cgroups wisely, or propagate the events manually (or, ask us to
implement the pass-through events, explaining why would you need them.)

Performance wise, the memory pressure notifications feature itself is
lightweight and does not require much of bookkeeping, in contrast to the
rest of memcg features.  Unfortunately, as of current memcg
implementation, pages accounting is an inseparable part and cannot be
turned off.  The good news is that there are some efforts[1] to improve
the situation; plus, implementing the same, fully API-compatible[2]
interface for CONFIG_MEMCG=n case (e.g.  embedded) is also a viable
option, so it will not require any changes on the userland side.

[1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
[2] http://lkml.org/lkml/2013/2/21/454

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
Signed-off-by: Anton Vorontsov &lt;anton.vorontsov@linaro.org&gt;
Acked-by: Kirill A. Shutemov &lt;kirill@shutemov.name&gt;
Acked-by: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Pekka Enberg &lt;penberg@kernel.org&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Glauber Costa &lt;glommer@parallels.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Luiz Capitulino &lt;lcapitulino@redhat.com&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Cc: Leonid Moiseichuk &lt;leonid.moiseichuk@nokia.com&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@gmail.com&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Bartlomiej Zolnierkiewicz &lt;b.zolnierkie@samsung.com&gt;
Cc: John Stultz &lt;john.stultz@linaro.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
