linux.git/fs/fs-writeback.c, branch v2.6.24

Revert "writeback: introduce writeback_control.more_io to indicate more io"

2008-01-15T05:21:29+00:00

This reverts commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b, as
requested by Fengguang Wu.  It's not quite fully baked yet, and while
there are patches around to fix the problems it caused, they should get
more testing.  Says Fengguang: "I'll resend them both for -mm later on,
in a more complete patchset".

See

	http://bugzilla.kernel.org/show_bug.cgi?id=9738

for some of this discussion.

Requested-by: Fengguang Wu 
Cc: Andrew Morton 
Cc: Peter Zijlstra 
Signed-off-by: Linus Torvalds

Use helpers to obtain task pid in printks

2007-10-19T18:53:43+00:00

The task_struct->pid member is going to be deprecated, so start
using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in
the kernel.

The first thing to start with is the pid, printed to dmesg - in
this case we may safely use task_pid_nr(). Besides, printks produce
more (much more) than a half of all the explicit pid usage.

[akpm@linux-foundation.org: git-drm went and changed lots of stuff]
Signed-off-by: Pavel Emelyanov 
Cc: Dave Airlie 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

introduce I_SYNC

2007-10-17T15:43:02+00:00

I_LOCK was used for several unrelated purposes, which caused deadlock
situations in certain filesystems as a side effect.  One of the purposes
now uses the new I_SYNC bit.

Also document the various bits and change their order from historical to
logical.

[bunk@stusta.de: make fs/inode.c:wake_up_inode() static]
Signed-off-by: Joern Engel 
Cc: Dave Kleikamp 
Cc: David Chinner 
Cc: Anton Altaparmakov 
Cc: Al Viro 
Cc: Christoph Hellwig 
Signed-off-by: Adrian Bunk 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

writeback: introduce writeback_control.more_io to indicate more io

2007-10-17T15:43:02+00:00

After making dirty a 100M file, the normal behavior is to start the writeback
for all data after 30s delays.  But sometimes the following happens instead:

	- after 30s:    ~4M
	- after 5s:     ~4M
	- after 5s:     all remaining 92M

Some analyze shows that the internal io dispatch queues goes like this:

		s_io            s_more_io
		-------------------------
	1)	100M,1K         0
	2)	1K              96M
	3)	0               96M

1) initial state with a 100M file and a 1K file
2) 4M written, nr_to_write <= 0, so write more
3) 1K written, nr_to_write > 0, no more writes(BUG)

nr_to_write > 0 in (3) fools the upper layer to think that data have all been
written out.  The big dirty file is actually still sitting in s_more_io.  We
cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and
let the loop in generic_sync_sb_inodes() continue: this may starve newly
expired inodes in s_dirty.  It is also not an option to draw inodes from both
s_more_io and s_dirty, an let the loop go on: this might lead to live locks,
and might also starve other superblocks in sync time(well kupdate may still
starve some superblocks, that's another bug).

We have to return when a full scan of s_io completes.  So nr_to_write > 0 does
not necessarily mean that "all data are written".  This patch introduces a
flag writeback_control.more_io to indicate this situation.  With it the big
dirty file no longer has to wait for the next kupdate invocation 5s later.

Cc: David Chinner 
Cc: Ken Chen 
Signed-off-by: Fengguang Wu 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

writeback: fix ntfs with sb_has_dirty_inodes()

2007-10-17T15:43:02+00:00

NTFS's if-condition on dirty inodes is not complete.  Fix it with
sb_has_dirty_inodes().

Cc: Anton Altaparmakov 
Cc: Ken Chen 
Signed-off-by: Fengguang Wu 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

writeback: fix time ordering of the per superblock inode lists 8

2007-10-17T15:43:02+00:00

Streamline the management of dirty inode lists and fix time ordering bugs.

The writeback logic used to move not-yet-expired dirty inodes from s_dirty to
s_io, *only to* move them back.  The move-inodes-back-and-forth thing is a
mess, which is eliminated by this patch.

The new scheme is:
- s_dirty acts as a time ordered io delaying queue;
- s_io/s_more_io together acts as an io dispatching queue.

On kupdate writeback, we pull some inodes from s_dirty to s_io at the start of
every full scan of s_io.  Otherwise  (i.e. for sync/throttle/background
writeback), we always pull from s_dirty on each run (a partial scan).

Note that the line
	list_splice_init(&sb->s_more_io, &sb->s_io);
is moved to queue_io() to leave s_io empty. Otherwise a big dirtied file will
sit in s_io for a long time, preventing new expired inodes to get in.

Cc: Ken Chen 
Cc: Andrew Morton 
Signed-off-by: Fengguang Wu 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

writeback: fix periodic superblock dirty inode flushing

2007-10-17T15:43:02+00:00

Current -mm tree has bucketful of bug fixes in periodic writeback path.
However, we still hit a glitch where dirty pages on a given inode aren't
completely flushed to the disk, and system will accumulate large amount of
dirty pages beyond what dirty_expire_interval is designed for.

The problem is __sync_single_inode() will move an inode to sb->s_dirty list
even when there are more pending dirty pages on that inode.  If there is
another inode with a small number of dirty pages, we hit a case where the loop
iteration in wb_kupdate() terminates prematurely because wbc.nr_to_write > 0.
Thus leaving the inode that has large amount of dirty pages behind and it has
to wait for another dirty_writeback_interval before we flush it again.  We
effectively only write out MAX_WRITEBACK_PAGES every dirty_writeback_interval.
If the rate of dirtying is sufficiently high, the system will start
accumulate a large number of dirty pages.

So fix it by having another sb->s_more_io list on which to park the inode
while we iterate through sb->s_io and to allow each dirty inode which resides
on that sb to have an equal chance of flushing some amount of dirty pages.

Signed-off-by: Ken Chen 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

writeback: fix time ordering of the per superblock dirty inode lists 7

2007-10-17T15:43:02+00:00

This one fixes four bugs.

There are a few situation in there where writeback decides it is going to skip
over a blockdev inode on the kernel-internal blockdev superblock.  It
presently does this by moving the blockdev inode onto the tail of the blockdev
superblock's s_dirty.  But

a) this screws up s_dirty's reverse-time-orderedness and

b) refiling the blockdev for writeback in another 30 second is rude.  We
   should try again sooner than that.

Fix all this up by using redirty_head(): move the blockdev inode onto the head
of the blockdev superblock's s_dirty list for prompt writeback.

Cc: Mike Waychison 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

writeback: fix time ordering of the per superblock dirty inode lists 6

2007-10-17T15:43:02+00:00

Recycling the previous changelog:

  When the writeback function is operating in writeback-for-flushing mode
  (as opposed to writeback-for-integrity) and it encounters an I_LOCKed inode,
  it will skip writing that inode.  This is done for throughput and latency:
  move on to another inode rather than blocking for this one.

  Writeback skips this inode by moving it off s_io and onto s_dirty, so that
  writeback can proceed with the other inodes on s_io.

  However that inode movement can corrupt s_dirty's
  reverse-time-orderedness.  Fix that by using the new redirty_tail(), which
  will update the refiled inode's dirtied_when field.

  Note: the behaviour in here is a bit rude: if kupdate happens to come
  across a locked inode then it will defer writeback of that inode for another
  30 seconds.  We'll address that in the next patch.

Address that here.  What we do is to move the skipped inode to the _head_ of
s_dirty, immediately eligible for writeout again.  Instead of deferring that
writeout for another 30 seconds.

One would think that this might cause a livelock: we keep on trying to write
the same locked inode.  But it won't because:

a) if that was the case, it would _already_ be happening on the
   balance_dirty_pages codepath.  Because balance_dirty_pages() doesn't care
   about inode timestamps.

b) if we skipped this inode then we won't have done any writeback.  The
   higher-level writeback paths will see that wbc.nr_to_write didn't change
   and they'll then back off and take a nap.

Cc: Mike Waychison 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

writeback: fix time ordering of the per superblock dirty inode lists 5

2007-10-17T15:43:02+00:00

When the writeback function is operating in writeback-for-flushing mode (as
opposed to writeback-for-integrity) and it encounters an I_LOCKed inode, it
will skip writing that inode.  This is done for throughput and latency: move
on to another inode rather than blocking for this one.

Writeback skips this inode by moving it off s_io and onto s_dirty, so that
writeback can proceed with the other inodes on s_io.

However that inode movement can corrupt s_dirty's reverse-time-orderedness.
Fix that by using the new redirty_tail(), which will update the refiled
inode's dirtied_when field.

Note: the behaviour in here is a bit rude: if kupdate happens to come across a
locked inode then it will defer writeback of that inode for another 30
seconds.  We'll address that in the next patch.

Cc: Mike Waychison 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds