linux.git/mm/page-writeback.c, branch v3.9

Merge tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux

2013-02-28T21:21:44+00:00

Pull writeback fixes from Wu Fengguang:
 "Two writeback fixes

   - fix negative (setpoint - dirty) in 32bit archs

   - use down_read_trylock() in writeback_inodes_sb(_nr)_if_idle()"

* tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
  Negative (setpoint-dirty) in bdi_position_ratio()
  vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them

Merge branch 'for-3.9/core' of git://git.kernel.dk/linux-block

2013-02-28T20:52:24+00:00

Pull block IO core bits from Jens Axboe:
 "Below are the core block IO bits for 3.9.  It was delayed a few days
  since my workstation kept crashing every 2-8h after pulling it into
  current -git, but turns out it is a bug in the new pstate code (divide
  by zero, will report separately).  In any case, it contains:

   - The big cfq/blkcg update from Tejun and and Vivek.

   - Additional block and writeback tracepoints from Tejun.

   - Improvement of the should sort (based on queues) logic in the plug
     flushing.

   - _io() variants of the wait_for_completion() interface, using
     io_schedule() instead of schedule() to contribute to io wait
     properly.

   - Various little fixes.

  You'll get two trivial merge conflicts, which should be easy enough to
  fix up"

Fix up the trivial conflicts due to hlist traversal cleanups (commit
b67bfe0d42ca: "hlist: drop the node parameter from iterators").

* 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
  block: remove redundant check to bd_openers()
  block: use i_size_write() in bd_set_size()
  cfq: fix lock imbalance with failed allocations
  drivers/block/swim3.c: fix null pointer dereference
  block: don't select PERCPU_RWSEM
  block: account iowait time when waiting for completion of IO request
  sched: add wait_for_completion_io[_timeout]
  writeback: add more tracepoints
  block: add block_{touch|dirty}_buffer tracepoint
  buffer: make touch_buffer() an exported function
  block: add @req to bio_{front|back}_merge tracepoints
  block: add missing block_bio_complete() tracepoint
  block: Remove should_sort judgement when flush blk_plug
  block,elevator: use new hashtable implementation
  cfq-iosched: add hierarchical cfq_group statistics
  cfq-iosched: collect stats from dead cfqgs
  cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
  blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
  block: RCU free request_queue
  blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
  ...

page-writeback.c: subtract min_free_kbytes from dirtyable memory

2013-02-24T01:50:17+00:00

When calculating amount of dirtyable memory, min_free_kbytes should be
subtracted because it is not intended for dirty pages.

Addresses http://bugs.debian.org/695182

[akpm@linux-foundation.org: fix up min_free_kbytes extern declarations]
[akpm@linux-foundation.org: fix min() warning]
Signed-off-by: Paul Szabo 
Acked-by: Rik van Riel 
Cc: Wu Fengguang 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

block: optionally snapshot page contents to provide stable pages during write

2013-02-22T01:22:20+00:00

This provides a band-aid to provide stable page writes on jbd without
needing to backport the fixed locking and page writeback bit handling
schemes of jbd2.  The band-aid works by using bounce buffers to snapshot
page contents instead of waiting.

For those wondering about the ext3 bandage -- fixing the jbd locking
(which was done as part of ext4dev years ago) is a lot of surgery, and
setting PG_writeback on data pages when we actually hold the page lock
dropped ext3 performance by nearly an order of magnitude.  If we're
going to migrate iscsi and raid to use stable page writes, the
complaints about high latency will likely return.  We might as well
centralize their page snapshotting thing to one place.

Signed-off-by: Darrick J. Wong 
Tested-by: Andy Lutomirski 
Cc: Adrian Hunter 
Cc: Artem Bityutskiy 
Reviewed-by: Jan Kara 
Cc: Joel Becker 
Cc: Mark Fasheh 
Cc: Steven Whitehouse 
Cc: Jens Axboe 
Cc: Eric Van Hensbergen 
Cc: Ron Minnich 
Cc: Latchesar Ionkov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: only enforce stable page writes if the backing device requires it

2013-02-22T01:22:19+00:00

Create a helper function to check if a backing device requires stable
page writes and, if so, performs the necessary wait.  Then, make it so
that all points in the memory manager that handle making pages writable
use the helper function.  This should provide stable page write support
to most filesystems, while eliminating unnecessary waiting for devices
that don't require the feature.

Before this patchset, all filesystems would block, regardless of whether
or not it was necessary.  ext3 would wait, but still generate occasional
checksum errors.  The network filesystems were left to do their own
thing, so they'd wait too.

After this patchset, all the disk filesystems except ext3 and btrfs will
wait only if the hardware requires it.  ext3 (if necessary) snapshots
pages instead of blocking, and btrfs provides its own bdi so the mm will
never wait.  Network filesystems haven't been touched, so either they
provide their own stable page guarantees or they don't block at all.
The blocking behavior is back to what it was before 3.0 if you don't
have a disk requiring stable page writes.

Here's the result of using dbench to test latency on ext2:

3.8.0-rc3:
 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 WriteX        109347     0.028    59.817
 ReadX         347180     0.004     3.391
 Flush          15514    29.828   287.283

Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms

3.8.0-rc3 + patches:
 WriteX        105556     0.029     4.273
 ReadX         335004     0.005     4.112
 Flush          14982    30.540   298.634

Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 ms

As you can see, the maximum write latency drops considerably with this
patch enabled.  The other filesystems (ext3/ext4/xfs/btrfs) behave
similarly, but see the cover letter for those results.

Signed-off-by: Darrick J. Wong 
Acked-by: Steven Whitehouse 
Reviewed-by: Jan Kara 
Cc: Adrian Hunter 
Cc: Andy Lutomirski 
Cc: Artem Bityutskiy 
Cc: Joel Becker 
Cc: Mark Fasheh 
Cc: Jens Axboe 
Cc: Eric Van Hensbergen 
Cc: Ron Minnich 
Cc: Latchesar Ionkov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

sched/rt: Move rt specific bits into new header file

2013-02-07T19:51:08+00:00

Move rt scheduler definitions out of include/linux/sched.h into
new file include/linux/sched/rt.h

Signed-off-by: Clark Williams 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Link: http://lkml.kernel.org/r/20130207094707.7b9f825f@riff.lan
Signed-off-by: Ingo Molnar

Negative (setpoint-dirty) in bdi_position_ratio()

2013-01-24T14:22:22+00:00

In bdi_position_ratio(), get difference (setpoint-dirty) right even when
negative. Both setpoint and dirty are unsigned long, the difference was
zero-padded thus wrongly sign-extended to s64. This issue affects all
32-bit architectures, does not affect 64-bit architectures where long
and s64 are equivalent.

In this function, dirty is between freerun and limit, the pseudo-float x
is between [-1,1], expected to be negative about half the time. With
zero-padding, instead of a small negative x we obtained a large positive
one so bdi_position_ratio() returned garbage.

Casting the difference to s64 also prevents overflow with left-shift;
though normally these numbers are small and I never observed a 32-bit
overflow there.

(This patch does not solve the PAE OOM issue.)

Paul Szabo   psz@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of Sydney    Australia

Reviewed-by: Jan Kara 
Reported-by: Paul Szabo 
Reference: http://bugs.debian.org/695182
Signed-off-by: Paul Szabo 
Signed-off-by: Fengguang Wu

writeback: add more tracepoints

2013-01-14T14:00:36+00:00

Add tracepoints for page dirtying, writeback_single_inode start, inode
dirtying and writeback.  For the latter two inode events, a pair of
events are defined to denote start and end of the operations (the
starting one has _start suffix and the one w/o suffix happens after
the operation is complete).  These inode ops are FS specific and can
be non-trivial and having enclosing tracepoints is useful for external
tracers.

This is part of tracepoint additions to improve visiblity into
dirtying / writeback operations for io tracer and userland.

v2: writeback_dirty_inode[_start] TPs may be called for files on
    pseudo FSes w/ unregistered bdi.  Check whether bdi->dev is %NULL
    before dereferencing.

v3: buffer dirtying moved to a block TP.

Signed-off-by: Tejun Heo 
Reviewed-by: Jan Kara 
Signed-off-by: Jens Axboe

mm: fix calculation of dirtyable memory

2012-12-21T01:40:18+00:00

The system uses global_dirtyable_memory() to calculate number of
dirtyable pages/pages that can be allocated to the page cache.  A bug
causes an underflow thus making the page count look like a big unsigned
number.  This in turn confuses the dirty writeback throttling to
aggressively write back pages as they become dirty (usually 1 page at a
time).  This generally only affects systems with highmem because the
underflowed count gets subtracted from the global count of dirtyable
memory.

The problem was introduced with v3.2-4896-gab8fabd

Fix is to ensure we don't get an underflowed total of either highmem or
global dirtyable memory.

Signed-off-by: Sonny Rao 
Signed-off-by: Puneet Kumar 
Acked-by: Johannes Weiner 
Tested-by: Damien Wyart 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

writeback: remove nr_pages_dirtied arg from balance_dirty_pages_ratelimited_nr()

2012-12-12T01:22:21+00:00

There is no reason to pass the nr_pages_dirtied argument, because
nr_pages_dirtied value from the caller is unused in
balance_dirty_pages_ratelimited_nr().

Signed-off-by: Namjae Jeon 
Signed-off-by: Vivek Trivedi 
Cc: Wu Fengguang 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds