linux.git/fs/fs-writeback.c, branch v4.13

writeback: rework wb_[dec|inc]_stat family of functions

2017-07-12T23:26:05+00:00

Currently the writeback statistics code uses a percpu counters to hold
various statistics.  Furthermore we have 2 families of functions - those
which disable local irq and those which doesn't and whose names begin
with double underscore.  However, they both end up calling
__add_wb_stats which in turn calls percpu_counter_add_batch which is
already irq-safe.

Exploiting this fact allows to eliminated the __wb_* functions since
they don't add any further protection than we already have.
Furthermore, refactor the wb_* function to call __add_wb_stat directly
without the irq-disabling dance.  This will likely result in better
runtime of code which deals with modifying the stat counters.

While at it also document why percpu_counter_add_batch is in fact
preempt and irq-safe since at least 3 people got confused.

Link: http://lkml.kernel.org/r/1498029937-27293-1-git-send-email-nborisov@suse.com
Signed-off-by: Nikolay Borisov 
Acked-by: Tejun Heo 
Reviewed-by: Jan Kara 
Cc: Josef Bacik 
Cc: Mel Gorman 
Cc: Jeff Layton 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

fs: add a blank lines on some kernel-doc comments

2017-05-16T11:44:10+00:00

Sphinx gets confused when it finds identation without a
good reason for it and without a preceding blank line:

	./fs/mpage.c:347: ERROR: Unexpected indentation.
	./fs/namei.c:4303: ERROR: Unexpected indentation.
	./fs/fs-writeback.c:2060: ERROR: Unexpected indentation.

No functional changes.

Signed-off-by: Mauro Carvalho Chehab

writeback: fix memory leak in wb_queue_work()

2017-03-13T14:27:34+00:00

When WB_registered flag is not set, wb_queue_work() skips queuing the
work, but does not perform the necessary clean up. In particular, if
work->auto_free is true, it should free the memory.

The leak condition can be reprouced by following these steps:

   mount /dev/sdb /mnt/sdb
   /* In qemu console: device_del sdb */
   umount /dev/sdb

Above will result in a wb_queue_work() call on an unregistered wb and
thus leak memory.

Reported-by: John Sperbeck 
Signed-off-by: Tahsin Erdogan 
Reviewed-by: Jan Kara 
Signed-off-by: Jens Axboe

fs/fs-writeback.c: remove redundant if check

2016-12-13T02:55:08+00:00

b_more_io non-empty check is already preceded by an opposite check.

Link: http://lkml.kernel.org/r/1478591249-30641-1-git-send-email-tahsin@google.com
Signed-off-by: Tahsin Erdogan 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, writeback: flush plugged IO in wakeup_flusher_threads()

2016-08-10T01:58:06+00:00

I've found funny live-lock between raid10 barriers during resync and
memory controller hard limits. Inside mpage_readpages() task holds on to
its plug bio which blocks the barrier in raid10. Its memory cgroup have
no free memory thus the task goes into reclaimer but all reclaimable
pages are dirty and cannot be written because raid10 is rebuilding and
stuck on the barrier.

Common flush of such IO in schedule() never happens, because the caller
doesn't go to sleep.

Lock is 'live' because changing memory limit or killing tasks which
holds that stuck bio unblock whole progress.

That was what happened in 3.18.x but I see no difference in upstream
logic.  Theoretically this might happen even without memory cgroup.

Signed-off-by: Konstantin Khlebnikov 
Signed-off-by: Jens Axboe

writeback: Write dirty times for WB_SYNC_ALL writeback

2016-08-04T20:19:16+00:00

Currently we take care to handle I_DIRTY_TIME in vfs_fsync() and
queue_io() so that inodes which have only dirty timestamps are properly
written on fsync(2) and sync(2). However there are other call sites -
most notably going through write_inode_now() - which expect inode to be
clean after WB_SYNC_ALL writeback. This is not currently true as we do
not clear I_DIRTY_TIME in __writeback_single_inode() even for
WB_SYNC_ALL writeback in all the cases. This then resulted in the
following oops because bdev_write_inode() did not clean the inode and
writeback code later stumbled over a dirty inode with detached wb.

  general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
  Modules linked in:
  CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Workqueue: writeback wb_workfn (flush-11:0)
  task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000
  RIP: 0010:[]  []
  locked_inode_to_wb_and_lock_list+0xa2/0x750
  RSP: 0018:ffff88006cdaf7d0  EFLAGS: 00010246
  RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
  RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
  RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
  R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
  R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
  FS:  0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
  DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
  Stack:
   ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
   ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
   ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
  Call Trace:
   [<     inline     >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
   [] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
   [] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
   [] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
   [<     inline     >] wb_do_writeback fs/fs-writeback.c:1844
   [] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
   [] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
   [] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
   [] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
   [] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
  Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
  00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
  80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
  RIP  [<     inline     >] wb_get include/linux/backing-dev-defs.h:212
  RIP  [] locked_inode_to_wb_and_lock_list+0xa2/0x750
  fs/fs-writeback.c:281
   RSP 
  ---[ end trace 986a4d314dcb2694 ]---

Fix the problem by making sure __writeback_single_inode() writes inode
only with dirty times in WB_SYNC_ALL mode.

Reported-by: Dmitry Vyukov 
Tested-by: Laurent Dufour 
Signed-off-by: Jan Kara 
Signed-off-by: Jens Axboe

mm: move most file-based accounting to the node

2016-07-28T23:07:41+00:00

There are now a number of accounting oddities such as mapped file pages
being accounted for on the node while the total number of file pages are
accounted on the zone.  This can be coped with to some extent but it's
confusing so this patch moves the relevant file-based accounted.  Due to
throttling logic in the page allocator for reliable OOM detection, it is
still necessary to track dirty and writeback pages on a per-zone basis.

[mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
  Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman 
Acked-by: Vlastimil Babka 
Acked-by: Michal Hocko 
Cc: Hillf Danton 
Acked-by: Johannes Weiner 
Cc: Joonsoo Kim 
Cc: Minchan Kim 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

fs/fs-writeback.c: inode writeback list tracking tracepoints

2016-07-26T23:19:19+00:00

The per-sb inode writeback list tracks inodes currently under writeback
to facilitate efficient sync processing.  In particular, it ensures that
sync only needs to walk through a list of inodes that were cleaned by
the sync.

Add a couple tracepoints to help identify when inodes are added/removed
to and from the writeback lists.  Piggyback off of the writeback
lazytime tracepoint template as it already tracks the relevant inode
information.

Link: http://lkml.kernel.org/r/1466594593-6757-3-git-send-email-bfoster@redhat.com
Signed-off-by: Brian Foster 
Reviewed-by: Jan Kara 
Cc: Dave Chinner 
cc: Josef Bacik 
Cc: Holger Hoffstätte 
Cc: Al Viro 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

fs/fs-writeback.c: add a new writeback list for sync

2016-07-26T23:19:19+00:00

wait_sb_inodes() currently does a walk of all inodes in the filesystem
to find dirty one to wait on during sync.  This is highly inefficient
and wastes a lot of CPU when there are lots of clean cached inodes that
we don't need to wait on.

To avoid this "all inode" walk, we need to track inodes that are
currently under writeback that we need to wait for.  We do this by
adding inodes to a writeback list on the sb when the mapping is first
tagged as having pages under writeback.  wait_sb_inodes() can then walk
this list of "inodes under IO" and wait specifically just for the inodes
that the current sync(2) needs to wait for.

Define a couple helpers to add/remove an inode from the writeback list
and call them when the overall mapping is tagged for or cleared from
writeback.  Update wait_sb_inodes() to walk only the inodes under
writeback due to the sync.

With this change, filesystem sync times are significantly reduced for
fs' with largely populated inode caches and otherwise no other work to
do.  For example, on a 16xcpu 2GHz x86-64 server, 10TB XFS filesystem
with a ~10m entry inode cache, sync times are reduced from ~7.3s to less
than 0.1s when the filesystem is fully clean.

Link: http://lkml.kernel.org/r/1466594593-6757-2-git-send-email-bfoster@redhat.com
Signed-off-by: Dave Chinner 
Signed-off-by: Josef Bacik 
Signed-off-by: Brian Foster 
Reviewed-by: Jan Kara 
Tested-by: Holger Hoffstätte 
Cc: Al Viro 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

writeback: inode cgroup wb switch should not call ihold()

2016-06-30T19:58:41+00:00

Asynchronous wb switching of inodes takes an additional ref count on an
inode to make sure inode remains valid until switchover is completed.

However, anyone calling ihold() must already have a ref count on inode,
but in this case inode->i_count may already be zero:

------------[ cut here ]------------
WARNING: CPU: 1 PID: 917 at fs/inode.c:397 ihold+0x2b/0x30
CPU: 1 PID: 917 Comm: kworker/u4:5 Not tainted 4.7.0-rc2+ #49
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
01/01/2011
Workqueue: writeback wb_workfn (flush-8:16)
 0000000000000000 ffff88007ca0fb58 ffffffff805990af 0000000000000000
 0000000000000000 ffff88007ca0fb98 ffffffff80268702 0000018d000004e2
 ffff88007cef40e8 ffff88007c9b89a8 ffff880079e3a740 0000000000000003
Call Trace:
 [] dump_stack+0x4d/0x6e
 [] __warn+0xc2/0xe0
 [] warn_slowpath_null+0x18/0x20
 [] ihold+0x2b/0x30
 [] inode_switch_wbs+0x11c/0x180
 [] wbc_detach_inode+0x170/0x1a0
 [] writeback_sb_inodes+0x21c/0x530
 [] wb_writeback+0xee/0x1e0
 [] wb_workfn+0xd7/0x280
 [] ? try_to_wake_up+0x1b1/0x2b0
 [] process_one_work+0x129/0x300
 [] worker_thread+0x126/0x480
 [] ? __schedule+0x1c7/0x561
 [] ? process_one_work+0x300/0x300
 [] kthread+0xc4/0xe0
 [] ? kfree+0xc8/0x100
 [] ret_from_fork+0x1f/0x40
 [] ? __kthread_parkme+0x70/0x70
---[ end trace aaefd2fd9f306bc4 ]---

Signed-off-by: Tahsin Erdogan 
Acked-by: Tejun Heo 
Reviewed-by: Jan Kara 
Signed-off-by: Jens Axboe