linux-stable.git/fs/jbd2/checkpoint.c, branch linux-3.2.y

jbd2: avoid infinite loop when destroying aborted journal

2015-10-13T02:46:13+00:00

commit 841df7df196237ea63233f0f9eaa41db53afd70f upstream.

Commit 6f6a6fda2945 "jbd2: fix ocfs2 corrupt when updating journal
superblock fails" changed jbd2_cleanup_journal_tail() to return EIO
when the journal is aborted. That makes logic in
jbd2_log_do_checkpoint() bail out which is fine, except that
jbd2_journal_destroy() expects jbd2_log_do_checkpoint() to always make
a progress in cleaning the journal. Without it jbd2_journal_destroy()
just loops in an infinite loop.

Fix jbd2_journal_destroy() to cleanup journal checkpoint lists of
jbd2_log_do_checkpoint() fails with error.

Reported-by: Eryu Guan 
Tested-by: Eryu Guan 
Fixes: 6f6a6fda294506dfe0e3e0a253bb2d2923f28f0a
Signed-off-by: Jan Kara 
Signed-off-by: Theodore Ts'o 
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings 
Cc: Roland Dreier

jbd2: fix ocfs2 corrupt when updating journal superblock fails

2015-08-12T14:33:15+00:00

commit 6f6a6fda294506dfe0e3e0a253bb2d2923f28f0a upstream.

If updating journal superblock fails after journal data has been
flushed, the error is omitted and this will mislead the caller as a
normal case.  In ocfs2, the checkpoint will be treated successfully
and the other node can get the lock to update. Since the sb_start is
still pointing to the old log block, it will rewrite the journal data
during journal recovery by the other node. Thus the new updates will
be overwritten and ocfs2 corrupts.  So in above case we have to return
the error, and ocfs2_commit_cache will take care of the error and
prevent the other node to do update first.  And only after recovering
journal it can do the new updates.

The issue discussion mail can be found at:
https://oss.oracle.com/pipermail/ocfs2-devel/2015-June/010856.html
http://comments.gmane.org/gmane.comp.file-systems.ext4/48841

[ Fixed bug in patch which allowed a non-negative error return from
  jbd2_cleanup_journal_tail() to leak out of jbd2_fjournal_flush(); this
  was causing xfstests ext4/306 to fail. -- Ted ]

Reported-by: Yiwen Jiang 
Signed-off-by: Joseph Qi 
Signed-off-by: Theodore Ts'o 
Tested-by: Yiwen Jiang 
Cc: Junxiao Bi 
[bwh: Backported to 3.2:
 - Adjust context
 - Don't drop j_checkpoint_mutex where we don't hold it]
Signed-off-by: Ben Hutchings

jbd2: use GFP_NOFS in jbd2_cleanup_journal_tail()

2015-08-12T14:33:15+00:00

commit b4f1afcd068f6e533230dfed00782cd8a907f96b upstream.

jbd2_cleanup_journal_tail() can be invoked by jbd2__journal_start()
So allocations should be done with GFP_NOFS

[Full stack trace snipped from 3.10-rh7]
[] dump_stack+0x19/0x1b
[] warn_slowpath_common+0x61/0x80
[] warn_slowpath_null+0x1a/0x20
[] slab_pre_alloc_hook.isra.31.part.32+0x15/0x17
[] kmem_cache_alloc+0x55/0x210
[] ? mempool_alloc_slab+0x15/0x20
[] mempool_alloc_slab+0x15/0x20
[] mempool_alloc+0x69/0x170
[] ? _raw_spin_unlock_irq+0xe/0x20
[] ? finish_task_switch+0x5d/0x150
[] bio_alloc_bioset+0x1be/0x2e0
[] blkdev_issue_flush+0x99/0x120
[] jbd2_cleanup_journal_tail+0x93/0xa0 [jbd2] -->GFP_KERNEL
[] jbd2_log_do_checkpoint+0x221/0x4a0 [jbd2]
[] __jbd2_log_wait_for_space+0xa7/0x1e0 [jbd2]
[] start_this_handle+0x2d8/0x550 [jbd2]
[] ? __memcg_kmem_put_cache+0x29/0x30
[] ? kmem_cache_alloc+0x130/0x210
[] jbd2__journal_start+0xba/0x190 [jbd2]
[] ? lru_cache_add+0xe/0x10
[] ? ext4_da_write_begin+0xf9/0x330 [ext4]
[] __ext4_journal_start_sb+0x77/0x160 [ext4]
[] ext4_da_write_begin+0xf9/0x330 [ext4]
[] generic_file_buffered_write_iter+0x10c/0x270
[] __generic_file_write_iter+0x178/0x390
[] __generic_file_aio_write+0x8b/0xb0
[] generic_file_aio_write+0x5d/0xc0
[] ext4_file_write+0xa9/0x450 [ext4]
[] ? pipe_read+0x379/0x4f0
[] do_sync_write+0x90/0xe0
[] vfs_write+0xbd/0x1e0
[] SyS_write+0x58/0xb0
[] system_call_fastpath+0x16/0x1b

Signed-off-by: Dmitry Monakhov 
Signed-off-by: Theodore Ts'o 
Signed-off-by: Ben Hutchings

jbd2: issue cache flush after checkpointing even with internal journal

2015-08-12T14:33:15+00:00

commit 79feb521a44705262d15cc819a4117a447b11ea7 upstream.

When we reach jbd2_cleanup_journal_tail(), there is no guarantee that
checkpointed buffers are on a stable storage - especially if buffers were
written out by jbd2_log_do_checkpoint(), they are likely to be only in disk's
caches. Thus when we update journal superblock effectively removing old
transaction from journal, this write of superblock can get to stable storage
before those checkpointed buffers which can result in filesystem corruption
after a crash. Thus we must unconditionally issue a cache flush before we
update journal superblock in these cases.

A similar problem can also occur if journal superblock is written only in
disk's caches, other transaction starts reusing space of the transaction
cleaned from the log and power failure happens. Subsequent journal replay would
still try to replay the old transaction but some of it's blocks may be already
overwritten by the new transaction. For this reason we must use WRITE_FUA when
updating log tail and we must first write new log tail to disk and update
in-memory information only after that.

Signed-off-by: Jan Kara 
Signed-off-by: "Theodore Ts'o" 
[bwh: Prerequisite for "jbd2: fix ocfs2 corrupt when updating journal
 superblock fails".
 Backported to 3.2:
 - Adjust context
 - Drop changes to jbd2_journal_update_sb_log_tail trace event]
Signed-off-by: Ben Hutchings

jbd2: split updating of journal superblock and marking journal empty

2015-08-12T14:33:15+00:00

commit 24bcc89c7e7c64982e6192b4952a0a92379fc341 upstream.

There are three case of updating journal superblock. In the first case, we want
to mark journal as empty (setting s_sequence to 0), in the second case we want
to update log tail, in the third case we want to update s_errno. Split these
cases into separate functions. It makes the code slightly more straightforward
and later patches will make the distinction even more important.

Signed-off-by: Jan Kara 
Signed-off-by: "Theodore Ts'o" 
[bwh: Prerequisite for "jbd2: fix ocfs2 corrupt when updating journal
 superblock fails".
 Backported to 3.2: drop changes to trace events.]
Signed-off-by: Ben Hutchings

jbd2: use WRITE_SYNC in journal checkpoint

2011-06-27T16:36:29+00:00

In journal checkpoint, we write the buffer and wait for its finish.
But in cfq, the async queue has a very low priority, and in our test,
if there are too many sync queues and every queue is filled up with
requests, the write request will be delayed for quite a long time and
all the tasks which are waiting for journal space will end with errors like:

INFO: task attr_set:3816 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
attr_set      D ffff880028393480     0  3816      1 0x00000000
 ffff8802073fbae8 0000000000000086 ffff8802140847c8 ffff8800283934e8
 ffff8802073fb9d8 ffffffff8103e456 ffff8802140847b8 ffff8801ed728080
 ffff8801db4bc080 ffff8801ed728450 ffff880028393480 0000000000000002
Call Trace:
 [] ? __dequeue_entity+0x33/0x38
 [] ? need_resched+0x23/0x2d
 [] ? thread_return+0xa2/0xbc
 [] ? jbd2_journal_dirty_metadata+0x116/0x126 [jbd2]
 [] ? jbd2_journal_dirty_metadata+0x116/0x126 [jbd2]
 [] __mutex_lock_common+0x14e/0x1a9
 [] ? brelse+0x13/0x15 [ext4]
 [] __mutex_lock_slowpath+0x19/0x1b
 [] mutex_lock+0x1b/0x32
 [] __jbd2_journal_insert_checkpoint+0xe3/0x20c [jbd2]
 [] start_this_handle+0x438/0x527 [jbd2]
 [] ? autoremove_wake_function+0x0/0x3e
 [] jbd2_journal_start+0xa1/0xcc [jbd2]
 [] ext4_journal_start_sb+0x57/0x81 [ext4]
 [] ext4_xattr_set+0x6c/0xe3 [ext4]
 [] ext4_xattr_user_set+0x42/0x4b [ext4]
 [] generic_setxattr+0x6b/0x76
 [] __vfs_setxattr_noperm+0x47/0xc0
 [] vfs_setxattr+0x7f/0x9a
 [] setxattr+0xb5/0xe8
 [] ? do_filp_open+0x571/0xa6e
 [] sys_fsetxattr+0x6b/0x91
 [] system_call_fastpath+0x16/0x1b

So this patch tries to use WRITE_SYNC in __flush_batch so that the request will
be moved into sync queue and handled by cfq timely. We also use the new plug,
sot that all the WRITE_SYNC requests can be given as a whole when we unplug it.

Signed-off-by: Tao Ma 
Signed-off-by: "Theodore Ts'o" 
Cc: Jan Kara 
Reported-by: Robin Dong

jbd2: Fix oops in jbd2_journal_remove_journal_head()

2011-06-13T19:38:22+00:00

jbd2_journal_remove_journal_head() can oops when trying to access
journal_head returned by bh2jh(). This is caused for example by the
following race:

	TASK1					TASK2
  jbd2_journal_commit_transaction()
    ...
    processing t_forget list
      __jbd2_journal_refile_buffer(jh);
      if (!jh->b_transaction) {
        jbd_unlock_bh_state(bh);
					jbd2_journal_try_to_free_buffers()
					  jbd2_journal_grab_journal_head(bh)
					  jbd_lock_bh_state(bh)
					  __journal_try_to_free_buffer()
					  jbd2_journal_put_journal_head(jh)
        jbd2_journal_remove_journal_head(bh);

jbd2_journal_put_journal_head() in TASK2 sees that b_jcount == 0 and
buffer is not part of any transaction and thus frees journal_head
before TASK1 gets to doing so. Note that even buffer_head can be
released by try_to_free_buffers() after
jbd2_journal_put_journal_head() which adds even larger opportunity for
oops (but I didn't see this happen in reality).

Fix the problem by making transactions hold their own journal_head
reference (in b_jcount). That way we don't have to remove journal_head
explicitely via jbd2_journal_remove_journal_head() and instead just
remove journal_head when b_jcount drops to zero. The result of this is
that [__]jbd2_journal_refile_buffer(),
[__]jbd2_journal_unfile_buffer(), and
__jdb2_journal_remove_checkpoint() can free journal_head which needs
modification of a few callers. Also we have to be careful because once
journal_head is removed, buffer_head might be freed as well. So we
have to get our own buffer_head reference where it matters.

Signed-off-by: Jan Kara 
Signed-off-by: "Theodore Ts'o"

Merge branch 'next' into upstream-merge

2010-10-28T03:44:47+00:00

Conflicts:
	fs/ext4/inode.c
	fs/ext4/mballoc.c
	include/trace/events/ext4.h

jbd2: Add sanity check for attempts to start handle during umount

2010-10-28T01:30:04+00:00

An attempt to modify the file system during the call to
jbd2_destroy_journal() can lead to a system lockup.  So add some
checking to make it much more obvious when this happens to and to
determine where the offending code is located.

Signed-off-by: "Theodore Ts'o"

block: remove BLKDEV_IFL_WAIT

2010-09-16T18:52:58+00:00

All the blkdev_issue_* helpers can only sanely be used for synchronous
caller.  To issue cache flushes or barriers asynchronously the caller needs
to set up a bio by itself with a completion callback to move the asynchronous
state machine ahead.  So drop the BLKDEV_IFL_WAIT flag that is always
specified when calling blkdev_issue_* and also remove the now unused flags
argument to blkdev_issue_flush and blkdev_issue_zeroout.  For
blkdev_issue_discard we need to keep it for the secure discard flag, which
gains a more descriptive name and loses the bitops vs flag confusion.

Signed-off-by: Christoph Hellwig 
Signed-off-by: Jens Axboe