linux-stable.git/include/linux/swap.h, branch v3.11.2

swap: discard while swapping only if SWAP_FLAG_DISCARD_PAGES

2013-07-03T23:07:32+00:00

Considering the use cases where the swap device supports discard:
a) and can do it quickly;
b) but it's slow to do in small granularities (or concurrent with other
   I/O);
c) but the implementation is so horrendous that you don't even want to
   send one down;

And assuming that the sysadmin considers it useful to send the discards down
at all, we would (probably) want the following solutions:

  i. do the fine-grained discards for freed swap pages, if device is
     capable of doing so optimally;
 ii. do single-time (batched) swap area discards, either at swapon
     or via something like fstrim (not implemented yet);
iii. allow doing both single-time and fine-grained discards; or
 iv. turn it off completely (default behavior)

As implemented today, one can only enable/disable discards for swap, but
one cannot select, for instance, solution (ii) on a swap device like (b)
even though the single-time discard is regarded to be interesting, or
necessary to the workload because it would imply (1), and the device is
not capable of performing it optimally.

This patch addresses the scenario depicted above by introducing a way to
ensure the (probably) wanted solutions (i, ii, iii and iv) can be flexibly
flagged through swapon(8) to allow a sysadmin to select the best suitable
swap discard policy accordingly to system constraints.

This patch introduces SWAP_FLAG_DISCARD_PAGES and SWAP_FLAG_DISCARD_ONCE
new flags to allow more flexibe swap discard policies being flagged
through swapon(8).  The default behavior is to keep both single-time, or
batched, area discards (SWAP_FLAG_DISCARD_ONCE) and fine-grained discards
for page-clusters (SWAP_FLAG_DISCARD_PAGES) enabled, in order to keep
consistentcy with older kernel behavior, as well as maintain compatibility
with older swapon(8).  However, through the new introduced flags the best
suitable discard policy can be selected accordingly to any given swap
device constraint.

[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Rafael Aquini 
Acked-by: KOSAKI Motohiro 
Cc: Hugh Dickins 
Cc: Shaohua Li 
Cc: Karel Zak 
Cc: Jeff Moyer 
Cc: Rik van Riel 
Cc: Larry Woodman 
Cc: Mel Gorman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: remove lru parameter from __lru_cache_add and lru_cache_add_lru

2013-07-03T23:07:31+00:00

Similar to __pagevec_lru_add, this patch removes the LRU parameter from
__lru_cache_add and lru_cache_add_lru as the caller does not control the
exact LRU the page gets added to.  lru_cache_add_lru gets renamed to
lru_cache_add the name is silly without the lru parameter.  With the
parameter removed, it is required that the caller indicate if they want
the page added to the active or inactive list by setting or clearing
PageActive respectively.

[akpm@linux-foundation.org: Suggested the patch]
[gang.chen@asianux.com: fix used-unintialized warning]
Signed-off-by: Mel Gorman 
Signed-off-by: Chen Gang 
Cc: Jan Kara 
Cc: Rik van Riel 
Acked-by: Johannes Weiner 
Cc: Alexey Lyahkov 
Cc: Andrew Perepechko 
Cc: Robin Dong 
Cc: Theodore Tso 
Cc: Hugh Dickins 
Cc: Rik van Riel 
Cc: Bernd Schubert 
Cc: David Howells 
Cc: Trond Myklebust 
Cc: Mel Gorman 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: thp: add split tail pages to shrink page list in page reclaim

2013-04-29T22:54:38+00:00

In page reclaim, huge page is split.  split_huge_page() adds tail pages
to LRU list.  Since we are reclaiming a huge page, it's better we
reclaim all subpages of the huge page instead of just the head page.
This patch adds split tail pages to shrink page list so the tail pages
can be reclaimed soon.

Before this patch, run a swap workload:
  thp_fault_alloc 3492
  thp_fault_fallback 608
  thp_collapse_alloc 6
  thp_collapse_alloc_failed 0
  thp_split 916

With this patch:
  thp_fault_alloc 4085
  thp_fault_fallback 16
  thp_collapse_alloc 90
  thp_collapse_alloc_failed 0
  thp_split 1272

fallback allocation is reduced a lot.

[akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
Signed-off-by: Shaohua Li 
Acked-by: Rik van Riel 
Acked-by: Minchan Kim 
Acked-by: Johannes Weiner 
Reviewed-by: Wanpeng Li 
Cc: Andrea Arcangeli 
Cc: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: allow for outstanding swap writeback accounting

2013-04-29T22:54:38+00:00

To prevent flooding the swap device with writebacks, frontswap backends
need to count and limit the number of outstanding writebacks.  The
incrementing of the counter can be done before the call to
__swap_writepage().  However, the caller must receive a notification
when the writeback completes in order to decrement the counter.

To achieve this functionality, this patch modifies __swap_writepage() to
take the bio completion callback function as an argument.

end_swap_bio_write(), the normal bio completion function, is also made
non-static so that code doing the accounting can call it after the
accounting is done.

There should be no behavioural change to existing code.

Signed-off-by: Seth Jennings 
Signed-off-by: Bob Liu 
Acked-by: Minchan Kim 
Reviewed-by: Dan Magenheimer 
Cc: Konrad Rzeszutek Wilk 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: break up swap_writepage() for frontswap backends

2013-04-29T22:54:38+00:00

swap_writepage() is currently where frontswap hooks into the swap write
path to capture pages with the frontswap_store() function.  However, if
a frontswap backend wants to "resume" the writeback of a page to the
swap device, it can't call swap_writepage() as the page will simply
reenter the backend.

This patch separates swap_writepage() into a top and bottom half, the
bottom half named __swap_writepage() to allow a frontswap backend, like
zswap, to resume writeback beyond the frontswap_store() hook.

__add_to_swap_cache() is also made non-static so that the page for which
writeback is to be resumed can be added to the swap cache.

Signed-off-by: Seth Jennings 
Signed-off-by: Bob Liu 
Acked-by: Minchan Kim 
Reviewed-by: Dan Magenheimer 
Cc: Konrad Rzeszutek Wilk 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

vmscan: change type of vm_total_pages to unsigned long

2013-02-24T01:50:22+00:00

This variable is calculated from nr_free_pagecache_pages so
change its type to unsigned long.

Signed-off-by: Zhang Yanfei 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: fix return type for functions nr_free_*_pages

2013-02-24T01:50:21+00:00

Currently, the amount of RAM that functions nr_free_*_pages return is
held in unsigned int.  But in machines with big memory (exceeding 16TB),
the amount may be incorrect because of overflow, so fix it.

Signed-off-by: Zhang Yanfei 
Cc: Simon Horman 
Cc: Julian Anastasov 
Cc: David Miller 
Cc: Eric Van Hensbergen 
Cc: Ron Minnich 
Cc: Latchesar Ionkov 
Cc: Mel Gorman 
Cc: Minchan Kim 
Cc: KAMEZAWA Hiroyuki 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

swap: add per-partition lock for swapfile

2013-02-24T01:50:17+00:00

swap_lock is heavily contended when I test swap to 3 fast SSD (even
slightly slower than swap to 2 such SSD).  The main contention comes
from swap_info_get().  This patch tries to fix the gap with adding a new
per-partition lock.

Global data like nr_swapfiles, total_swap_pages, least_priority and
swap_list are still protected by swap_lock.

nr_swap_pages is an atomic now, it can be changed without swap_lock.  In
theory, it's possible get_swap_page() finds no swap pages but actually
there are free swap pages.  But sounds not a big problem.

Accessing partition specific data (like scan_swap_map and so on) is only
protected by swap_info_struct.lock.

Changing swap_info_struct.flags need hold swap_lock and
swap_info_struct.lock, because scan_scan_map() will check it.  read the
flags is ok with either the locks hold.

If both swap_lock and swap_info_struct.lock must be hold, we always hold
the former first to avoid deadlock.

swap_entry_free() can change swap_list.  To delete that code, we add a
new highest_priority_index.  Whenever get_swap_page() is called, we
check it.  If it's valid, we use it.

It's a pity get_swap_page() still holds swap_lock().  But in practice,
swap_lock() isn't heavily contended in my test with this patch (or I can
say there are other much more heavier bottlenecks like TLB flush).  And
BTW, looks get_swap_page() doesn't really need the lock.  We never free
swap_info[] and we check SWAP_WRITEOK flag.  The only risk without the
lock is we could swapout to some low priority swap, but we can quickly
recover after several rounds of swap, so sounds not a big deal to me.
But I'd prefer to fix this if it's a real problem.

"swap: make each swap partition have one address_space" improved the
swapout speed from 1.7G/s to 2G/s.  This patch further improves the
speed to 2.3G/s, so around 15% improvement.  It's a multi-process test,
so TLB flush isn't the biggest bottleneck before the patches.

[arnd@arndb.de: fix it for nommu]
[hughd@google.com: add missing unlock]
[minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
Signed-off-by: Shaohua Li 
Cc: Hugh Dickins 
Cc: Rik van Riel 
Cc: Minchan Kim 
Cc: Greg Kroah-Hartman 
Cc: Seth Jennings 
Cc: Konrad Rzeszutek Wilk 
Cc: Xiao Guangrong 
Cc: Dan Magenheimer 
Cc: Stephen Rothwell 
Signed-off-by: Arnd Bergmann 
Signed-off-by: Hugh Dickins 
Signed-off-by: Minchan Kim 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

swap: make each swap partition have one address_space

2013-02-24T01:50:17+00:00

When I use several fast SSD to do swap, swapper_space.tree_lock is
heavily contended.  This makes each swap partition have one
address_space to reduce the lock contention.  There is an array of
address_space for swap.  The swap entry type is the index to the array.

In my test with 3 SSD, this increases the swapout throughput 20%.

[akpm@linux-foundation.org: revert unneeded change to  __add_to_swap_cache]
Signed-off-by: Shaohua Li 
Cc: Hugh Dickins 
Acked-by: Rik van Riel 
Acked-by: Minchan Kim 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: vmscan: save work scanning (almost) empty LRU lists

2013-02-24T01:50:09+00:00

In certain cases (kswapd reclaim, memcg target reclaim), a fixed minimum
amount of pages is scanned from the LRU lists on each iteration, to make
progress.

Do not make this minimum bigger than the respective LRU list size,
however, and save some busy work trying to isolate and reclaim pages
that are not there.

Empty LRU lists are quite common with memory cgroups in NUMA
environments because there exists a set of LRU lists for each zone for
each memory cgroup, while the memory of a single cgroup is expected to
stay on just one node.  The number of expected empty LRU lists is thus

  memcgs * (nodes - 1) * lru types

Each attempt to reclaim from an empty LRU list does expensive size
comparisons between lists, acquires the zone's lru lock etc.  Avoid
that.

Signed-off-by: Johannes Weiner 
Reviewed-by: Rik van Riel 
Acked-by: Mel Gorman 
Reviewed-by: Michal Hocko 
Cc: Hugh Dickins 
Cc: Satoru Moriya 
Cc: Simon Jeons 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds