linux.git/fs/fscache, branch v3.11

FS-Cache: Don't use spin_is_locked() in assertions

2013-06-19T13:16:47+00:00

Under certain circumstances, spin_is_locked() is hardwired to 0 - even when the
code would normally be in a locked section where it should return 1.  This
means it cannot be used for an assertion that checks that a spinlock is locked.

Remove such usages from FS-Cache.

The following oops might otherwise be observed:

FS-Cache: Assertion failed
BUG: failure at fs/fscache/operation.c:270/fscache_start_operations()!
Kernel panic - not syncing: BUG!
CPU: 0 PID: 10 Comm: kworker/u2:1 Not tainted 3.10.0-rc1-00133-ge7ebb75 #2
Workqueue: fscache_operation fscache_op_work_func [fscache]
7f091c48 603c8947 7f090000 7f9b1361 7f25f080 00000001 7f26d440 7f091c90
60299eb8 7f091d90 602951c5 7f26d440 3000000008 7f091da0 7f091cc0 7f091cd0
00000007 00000007 00000006 7f091ae0 00000010 0000010e 7f9af330 7f091ae0
Call Trace:
7f091c88: [<60299eb8>] dump_stack+0x17/0x19
7f091c98: [<602951c5>] panic+0xf4/0x1e9
7f091d38: [<6002b10e>] set_signals+0x1e/0x40
7f091d58: [<6005b89e>] __wake_up+0x4e/0x70
7f091d98: [<7f9aa003>] fscache_start_operations+0x43/0x50 [fscache]
7f091da8: [<7f9aa1e3>] fscache_op_complete+0x1d3/0x220 [fscache]
7f091db8: [<60082985>] unlock_page+0x55/0x60
7f091de8: [<7fb25bb0>] cachefiles_read_copier+0x250/0x330 [cachefiles]
7f091e58: [<7f9ab03c>] fscache_op_work_func+0xac/0x120 [fscache]
7f091e88: [<6004d5b0>] process_one_work+0x250/0x3a0
7f091ef8: [<6004edc7>] worker_thread+0x177/0x2a0
7f091f38: [<6004ec50>] worker_thread+0x0/0x2a0
7f091f58: [<60054418>] kthread+0xd8/0xe0
7f091f68: [<6005bb27>] finish_task_switch.isra.64+0x37/0xa0
7f091fd8: [<600185cf>] new_thread_handler+0x8f/0xb0

Reported-by: Milosz Tanski 
Signed-off-by: David Howells 
Reviewed-and-tested-By: Milosz Tanski

FS-Cache: The retrieval remaining-pages counter needs to be atomic_t

2013-06-19T13:16:47+00:00

struct fscache_retrieval contains a count of the number of pages that still
need some processing (n_pages).  This is decremented as the pages are
processed.

However, this needs to be atomic as fscache_retrieval_complete() (I think) just
occasionally may be called from cachefiles_read_backing_file() and
cachefiles_read_copier() simultaneously.

This happens when an fscache_read_or_alloc_pages() request containing a lot of
pages (say a couple of hundred) is being processed.  The read on each backing
page is dispatched individually because we need to insert a monitor into the
waitqueue to catch when the read completes.  However, under low-memory
conditions, we might be forced to wait in the allocator - and this gives the
I/O on the backing page a chance to complete first.

When the I/O completes, fscache_enqueue_retrieval() chucks the retrieval onto
the workqueue without waiting for the operation to finish the initial I/O
dispatch (we want to release any pages we can as soon as we can), thus both can
end up running simultaneously and potentially attempting to partially complete
the retrieval simultaneously (ENOMEM may occur, backing pages may already be in
the page cache).

This was demonstrated by parallelling the non-atomic counter with an atomic
counter and printing both of them when the assertion fails.  At this point, the
atomic counter has reached zero, but the non-atomic counter has not.

To fix this, make the counter an atomic_t.

This results in the following bug appearing

	FS-Cache: Assertion failed
	3 == 5 is false
	------------[ cut here ]------------
	kernel BUG at fs/fscache/operation.c:421!

or

	FS-Cache: Assertion failed
	3 == 5 is false
	------------[ cut here ]------------
	kernel BUG at fs/fscache/operation.c:414!

With a backtrace like the following:

RIP: 0010:[] fscache_put_operation+0x1ad/0x240 [fscache]
Call Trace:
 [] fscache_retrieval_work+0x55/0x270 [fscache]
 [] ? fscache_retrieval_work+0x0/0x270 [fscache]
 [] worker_thread+0x170/0x2a0
 [] ? autoremove_wake_function+0x0/0x40
 [] ? worker_thread+0x0/0x2a0
 [] kthread+0x96/0xa0
 [] child_rip+0xa/0x20
 [] ? kthread+0x0/0xa0
 [] ? child_rip+0x0/0x20

Signed-off-by: David Howells 
Reviewed-and-tested-By: Milosz Tanski 
Acked-by: Jeff Layton

FS-Cache: Simplify cookie retention for fscache_objects, fixing oops

2013-06-19T13:16:47+00:00

Simplify the way fscache cache objects retain their cookie.  The way I
implemented the cookie storage handling made synchronisation a pain (ie. the
object state machine can't rely on the cookie actually still being there).

Instead of the the object being detached from the cookie and the cookie being
freed in __fscache_relinquish_cookie(), we defer both operations:

 (*) The detachment of the object from the list in the cookie now takes place
     in fscache_drop_object() and is thus governed by the object state machine
     (fscache_detach_from_cookie() has been removed).

 (*) The release of the cookie is now in fscache_object_destroy() - which is
     called by the cache backend just before it frees the object.

This means that the fscache_cookie struct is now available to the cache all the
way through from ->alloc_object() to ->drop_object() and ->put_object() -
meaning that it's no longer necessary to take object->lock to guarantee access.

However, __fscache_relinquish_cookie() doesn't wait for the object to go all
the way through to destruction before letting the netfs proceed.  That would
massively slow down the netfs.  Since __fscache_relinquish_cookie() leaves the
cookie around, in must therefore break all attachments to the netfs - which
includes ->def, ->netfs_data and any outstanding page read/writes.

To handle this, struct fscache_cookie now has an n_active counter:

 (1) This starts off initialised to 1.

 (2) Any time the cache needs to get at the netfs data, it calls
     fscache_use_cookie() to increment it - if it is not zero.  If it was zero,
     then access is not permitted.

 (3) When the cache has finished with the data, it calls fscache_unuse_cookie()
     to decrement it.  This does a wake-up on it if it reaches 0.

 (4) __fscache_relinquish_cookie() decrements n_active and then waits for it to
     reach 0.  The initialisation to 1 in step (1) ensures that we only get
     wake ups when we're trying to get rid of the cookie.

This leaves __fscache_relinquish_cookie() a lot simpler.


***
This fixes a problem in the current code whereby if fscache_invalidate() is
followed sufficiently quickly by fscache_relinquish_cookie() then it is
possible for __fscache_relinquish_cookie() to have detached the cookie from the
object and cleared the pointer before a thread is dispatched to process the
invalidation state in the object state machine.

Since the pending write clearance was deferred to the invalidation state to
make it asynchronous, we need to either wait in relinquishment for the stores
tree to be cleared in the invalidation state or we need to handle the clearance
in relinquishment.

Further, if the relinquishment code does clear the tree, then the invalidation
state need to make the clearance contingent on still having the cookie to hand
(since that's where the tree is rooted) and we have to prevent the cookie from
disappearing for the duration.

This can lead to an oops like the following:

BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
...
RIP: 0010:[] _spin_lock+0xe/0x30
...
CR2: 000000000000000c ...
...
Process kslowd002 (...)
....
Call Trace:
 [] fscache_invalidate_writes+0x38/0xd0 [fscache]
 [] ? __switch_to+0xd0/0x320
 [] ? find_busiest_queue+0x69/0x150
 [] ? slow_work_enqueue+0x104/0x180
 [] fscache_object_slow_work_execute+0x5e3/0x9d0 [fscache]
 [] ? bit_waitqueue+0x17/0xd0
 [] slow_work_execute+0x233/0x310
 [] slow_work_thread+0x205/0x360
 [] ? autoremove_wake_function+0x0/0x40
 [] ? slow_work_thread+0x0/0x360
 [] kthread+0x96/0xa0
 [] child_rip+0xa/0x20
 [] ? kthread+0x0/0xa0
 [] ? child_rip+0x0/0x20

The parameter to fscache_invalidate_writes() was object->cookie which is NULL.

Signed-off-by: David Howells 
Tested-By: Milosz Tanski 
Acked-by: Jeff Layton

FS-Cache: Fix object state machine to have separate work and wait states

2013-06-19T13:16:47+00:00

Fix object state machine to have separate work and wait states as that makes
it easier to envision.

There are now three kinds of state:

 (1) Work state.  This is an execution state.  No event processing is performed
     by a work state.  The function attached to a work state returns a pointer
     indicating the next state to which the OSM should transition.  Returning
     NO_TRANSIT repeats the current state, but goes back to the scheduler
     first.

 (2) Wait state.  This is an event processing state.  No execution is
     performed by a wait state.  Wait states are just tables of "if event X
     occurs, clear it and transition to state Y".  The dispatcher returns to
     the scheduler if none of the events in which the wait state has an
     interest are currently pending.

 (3) Out-of-band state.  This is a special work state.  Transitions to normal
     states can be overridden when an unexpected event occurs (eg. I/O error).
     Instead the dispatcher disables and clears the OOB event and transits to
     the specified work state.  This then acts as an ordinary work state,
     though object->state points to the overridden destination.  Returning
     NO_TRANSIT resumes the overridden transition.

In addition, the states have names in their definitions, so there's no need for
tables of state names.  Further, the EV_REQUEUE event is no longer necessary as
that is automatic for work states.

Since the states are now separate structs rather than values in an enum, it's
not possible to use comparisons other than (non-)equality between them, so use
some object->flags to indicate what phase an object is in.

The EV_RELEASE, EV_RETIRE and EV_WITHDRAW events have been squished into one
(EV_KILL).  An object flag now carries the information about retirement.

Similarly, the RELEASING, RECYCLING and WITHDRAWING states have been merged
into an KILL_OBJECT state and additional states have been added for handling
waiting dependent objects (JUMPSTART_DEPS and KILL_DEPENDENTS).

A state has also been added for synchronising with parent object initialisation
(WAIT_FOR_PARENT) and another for initiating look up (PARENT_READY).

Signed-off-by: David Howells 
Tested-By: Milosz Tanski 
Acked-by: Jeff Layton

FS-Cache: Wrap checks on object state

2013-06-19T13:16:47+00:00

Wrap checks on object state (mostly outside of fs/fscache/object.c) with
inline functions so that the mechanism can be replaced.

Some of the state checks within object.c are left as-is as they will be
replaced.

Signed-off-by: David Howells 
Tested-By: Milosz Tanski 
Acked-by: Jeff Layton

FS-Cache: Uninline fscache_object_init()

2013-06-19T13:16:47+00:00

Uninline fscache_object_init() so as not to expose some of the FS-Cache
internals to the cache backend.

Signed-off-by: David Howells 
Tested-By: Milosz Tanski 
Acked-by: Jeff Layton

FS-Cache: Don't sleep in page release if __GFP_FS is not set

2013-06-19T13:16:47+00:00

Don't sleep in __fscache_maybe_release_page() if __GFP_FS is not set.  This
goes some way towards mitigating fscache deadlocking against ext4 by way of
the allocator, eg:

INFO: task flush-8:0:24427 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-8:0       D ffff88003e2b9fd8     0 24427      2 0x00000000
 ffff88003e2b9138 0000000000000046 ffff880012e3a040 ffff88003e2b9fd8
 0000000000011c80 ffff88003e2b9fd8 ffffffff81a10400 ffff880012e3a040
 0000000000000002 ffff880012e3a040 ffff88003e2b9098 ffffffff8106dcf5
Call Trace:
 [] ? __lock_is_held+0x31/0x53
 [] ? radix_tree_lookup_element+0xf4/0x12a
 [] schedule+0x60/0x62
 [] __fscache_wait_on_page_write+0x8b/0xa5 [fscache]
 [] ? __init_waitqueue_head+0x4d/0x4d
 [] __fscache_maybe_release_page+0x30c/0x324 [fscache]
 [] ? __fscache_maybe_release_page+0x6c/0x324 [fscache]
 [] ? trace_hardirqs_on_caller+0x114/0x170
 [] nfs_fscache_release_page+0x68/0x94 [nfs]
 [] nfs_release_page+0x7e/0x86 [nfs]
 [] try_to_release_page+0x32/0x3b
 [] shrink_page_list+0x535/0x71a
 [] ? trace_hardirqs_on_caller+0x114/0x170
 [] shrink_inactive_list+0x20a/0x2dd
 [] ? mark_held_locks+0xbe/0xea
 [] shrink_lruvec+0x34c/0x3eb
 [] do_try_to_free_pages+0xcf/0x355
 [] try_to_free_pages+0x9a/0xa1
 [] __alloc_pages_nodemask+0x494/0x6f7
 [] kmem_getpages+0x58/0x155
 [] fallback_alloc+0x120/0x1f3
 [] ? trace_hardirqs_off+0xd/0xf
 [] ____cache_alloc_node+0x177/0x186
 [] ? ext4_init_io_end+0x1c/0x37
 [] kmem_cache_alloc+0xf1/0x176
 [] ? test_set_page_writeback+0x101/0x113
 [] ext4_init_io_end+0x1c/0x37
 [] ext4_bio_write_page+0x20f/0x3af
 [] mpage_da_submit_io+0x26e/0x2f6
 [] ? __find_get_block_slow+0x38/0x133
 [] mpage_da_map_and_submit+0x3a7/0x3bd
 [] ext4_da_writepages+0x30d/0x426
 [] do_writepages+0x1c/0x2a
 [] __writeback_single_inode+0x3e/0xe5
 [] writeback_sb_inodes+0x1bd/0x2f4
 [] __writeback_inodes_wb+0x6f/0xb4
 [] wb_writeback+0x101/0x195
 [] ? trace_hardirqs_on_caller+0x114/0x170
 [] ? wb_do_writeback+0xaa/0x173
 [] wb_do_writeback+0x4a/0x173
 [] ? trace_hardirqs_on+0xd/0xf
 [] ? del_timer+0x4b/0x5b
 [] bdi_writeback_thread+0x6d/0x147
 [] ? wb_do_writeback+0x173/0x173
 [] kthread+0xd0/0xd8
 [] ? _raw_spin_unlock_irq+0x29/0x3e
 [] ? __init_kthread_worker+0x55/0x55
 [] ret_from_fork+0x7c/0xb0
 [] ? __init_kthread_worker+0x55/0x55
2 locks held by flush-8:0/24427:
 #0:  (&type->s_umount_key#41){.+.+..}, at: [] grab_super_passive+0x4c/0x76
 #1:  (jbd2_handle){+.+...}, at: [] start_this_handle+0x475/0x4ea


The problem here is that another thread, which is attempting to write the
to-be-stored NFS page to the on-ext4 cache file is waiting for the journal
lock, eg:

INFO: task kworker/u:2:24437 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u:2     D ffff880039589768     0 24437      2 0x00000000
 ffff8800395896d8 0000000000000046 ffff8800283bf040 ffff880039589fd8
 0000000000011c80 ffff880039589fd8 ffff880039f0b040 ffff8800283bf040
 0000000000000006 ffff8800283bf6b8 ffff880039589658 ffffffff81071a13
Call Trace:
 [] ? mark_held_locks+0xbe/0xea
 [] ? _raw_spin_unlock_irqrestore+0x3a/0x50
 [] ? trace_hardirqs_on_caller+0x114/0x170
 [] ? trace_hardirqs_on+0xd/0xf
 [] schedule+0x60/0x62
 [] start_this_handle+0x317/0x4ea
 [] ? __init_waitqueue_head+0x4d/0x4d
 [] jbd2__journal_start+0xb3/0x12e
 [] __ext4_journal_start_sb+0xb2/0xc6
 [] ext4_da_write_begin+0x109/0x233
 [] generic_file_buffered_write+0x11a/0x264
 [] ? __mark_inode_dirty+0x2d/0x1ee
 [] __generic_file_aio_write+0x2a5/0x2d5
 [] generic_file_aio_write+0x6f/0xd0
 [] ext4_file_write+0x38c/0x3c4
 [] do_sync_write+0x91/0xd1
 [] cachefiles_write_page+0x26f/0x310 [cachefiles]
 [] fscache_write_op+0x21e/0x37a [fscache]
 [] ? _raw_spin_unlock_irq+0x29/0x3e
 [] fscache_op_work_func+0x78/0xd7 [fscache]
 [] process_one_work+0x232/0x3a8
 [] ? process_one_work+0x1d7/0x3a8
 [] worker_thread+0x214/0x303
 [] ? manage_workers+0x245/0x245
 [] kthread+0xd0/0xd8
 [] ? _raw_spin_unlock_irq+0x29/0x3e
 [] ? __init_kthread_worker+0x55/0x55
 [] ret_from_fork+0x7c/0xb0
 [] ? __init_kthread_worker+0x55/0x55
4 locks held by kworker/u:2/24437:
 #0:  (fscache_operation){.+.+.+}, at: [] process_one_work+0x1d7/0x3a8
 #1:  ((&op->work)){+.+.+.}, at: [] process_one_work+0x1d7/0x3a8
 #2:  (sb_writers#14){.+.+.+}, at: [] generic_file_aio_write+0x51/0xd0
 #3:  (&sb->s_type->i_mutex_key#19){+.+.+.}, at: [] generic_file_aio_write+0x5b/0x

fscache already tries to cancel pending stores, but it can't cancel a write
for which I/O is already in progress.

An alternative would be to accept writing garbage to the cache under extreme
circumstances and to kill the afflicted cache object if we have to do this.
However, we really need to know how strapped the allocator is before deciding
to do that.

Signed-off-by: David Howells 
Tested-By: Milosz Tanski 
Acked-by: Jeff Layton

fs/fscache: remove spin_lock() from the condition in while()

2013-06-19T13:16:47+00:00

The spinlock() within the condition in while() will cause a compile error
if it is not a function. This is not a problem on mainline but it does not
look pretty and there is no reason to do it that way.
That patch writes it a little differently and avoids the double condition.

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: David Howells 
Tested-By: Milosz Tanski 
Acked-by: Jeff Layton

fs/fscache/stats.c: fix memory leak

2013-04-29T22:54:27+00:00

There is a kernel memory leak observed when the proc file
/proc/fs/fscache/stats is read.

The reason is that in fscache_stats_open, single_open is called and the
respective release function is not called during release.  Hence fix
with correct release function - single_release().

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=57101

Signed-off-by: Anurup m 
Cc: shyju pv 
Cc: Sanil kumar 
Cc: Nataraj m 
Cc: Li Zefan 
Cc: David Howells 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

hlist: drop the node parameter from iterators

2013-02-28T03:10:24+00:00

I'm not sure why, but the hlist for each entry iterators were conceived

        list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

        hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

 - Fix up the actual hlist iterators in linux/list.h
 - Fix up the declaration of other iterators based on the hlist ones.
 - A very small amount of places were using the 'node' parameter, this
 was modified to use 'obj->member' instead.
 - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
 properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;
    <+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
    ...+>

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin 
Acked-by: Paul E. McKenney 
Signed-off-by: Sasha Levin 
Cc: Wu Fengguang 
Cc: Marcelo Tosatti 
Cc: Gleb Natapov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds