linux-stable.git/fs/jbd2, branch linux-3.4.y

jbd2: Fix unreclaimed pages after truncate in data=journal mode

2016-10-26T15:15:34+00:00

commit bc23f0c8d7ccd8d924c4e70ce311288cb3e61ea8 upstream.

Ted and Namjae have reported that truncated pages don't get timely
reclaimed after being truncated in data=journal mode. The following test
triggers the issue easily:

for (i = 0; i < 1000; i++) {
	pwrite(fd, buf, 1024*1024, 0);
	fsync(fd);
	fsync(fd);
	ftruncate(fd, 0);
}

The reason is that journal_unmap_buffer() finds that truncated buffers
are not journalled (jh->b_transaction == NULL), they are part of
checkpoint list of a transaction (jh->b_cp_transaction != NULL) and have
been already written out (!buffer_dirty(bh)). We clean such buffers but
we leave them in the checkpoint list. Since checkpoint transaction holds
a reference to the journal head, these buffers cannot be released until
the checkpoint transaction is cleaned up. And at that point we don't
call release_buffer_page() anymore so pages detached from mapping are
lingering in the system waiting for reclaim to find them and free them.

Fix the problem by removing buffers from transaction checkpoint lists
when journal_unmap_buffer() finds out they don't have to be there
anymore.

Reported-and-tested-by: Namjae Jeon 
Fixes: de1b794130b130e77ffa975bb58cb843744f9ae5
Signed-off-by: Jan Kara 
Signed-off-by: Theodore Ts'o 
Signed-off-by: Zefan Li

ext4, jbd2: ensure entering into panic after recording an error in superblock

2016-10-26T15:15:25+00:00

commit 4327ba52afd03fc4b5afa0ee1d774c9c5b0e85c5 upstream.

If a EXT4 filesystem utilizes JBD2 journaling and an error occurs, the
journaling will be aborted first and the error number will be recorded
into JBD2 superblock and, finally, the system will enter into the
panic state in "errors=panic" option.  But, in the rare case, this
sequence is little twisted like the below figure and it will happen
that the system enters into panic state, which means the system reset
in mobile environment, before completion of recording an error in the
journal superblock. In this case, e2fsck cannot recognize that the
filesystem failure occurred in the previous run and the corruption
wouldn't be fixed.

Task A                        Task B
ext4_handle_error()
-> jbd2_journal_abort()
  -> __journal_abort_soft()
    -> __jbd2_journal_abort_hard()
    | -> journal->j_flags |= JBD2_ABORT;
    |
    |                         __ext4_abort()
    |                         -> jbd2_journal_abort()
    |                         | -> __journal_abort_soft()
    |                         |   -> if (journal->j_flags & JBD2_ABORT)
    |                         |           return;
    |                         -> panic()
    |
    -> jbd2_journal_update_sb_errno()

Tested-by: Hobin Woo 
Signed-off-by: Daeho Jeong 
Signed-off-by: Theodore Ts'o 
Signed-off-by: Zefan Li

jbd2: avoid infinite loop when destroying aborted journal

2015-10-22T01:20:08+00:00

commit 841df7df196237ea63233f0f9eaa41db53afd70f upstream.

Commit 6f6a6fda2945 "jbd2: fix ocfs2 corrupt when updating journal
superblock fails" changed jbd2_cleanup_journal_tail() to return EIO
when the journal is aborted. That makes logic in
jbd2_log_do_checkpoint() bail out which is fine, except that
jbd2_journal_destroy() expects jbd2_log_do_checkpoint() to always make
a progress in cleaning the journal. Without it jbd2_journal_destroy()
just loops in an infinite loop.

Fix jbd2_journal_destroy() to cleanup journal checkpoint lists of
jbd2_log_do_checkpoint() fails with error.

Reported-by: Eryu Guan 
Tested-by: Eryu Guan 
Fixes: 6f6a6fda294506dfe0e3e0a253bb2d2923f28f0a
Signed-off-by: Jan Kara 
Signed-off-by: Theodore Ts'o 
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li

jbd2: fix ocfs2 corrupt when updating journal superblock fails

2015-10-22T01:20:04+00:00

commit 6f6a6fda294506dfe0e3e0a253bb2d2923f28f0a upstream.

If updating journal superblock fails after journal data has been
flushed, the error is omitted and this will mislead the caller as a
normal case.  In ocfs2, the checkpoint will be treated successfully
and the other node can get the lock to update. Since the sb_start is
still pointing to the old log block, it will rewrite the journal data
during journal recovery by the other node. Thus the new updates will
be overwritten and ocfs2 corrupts.  So in above case we have to return
the error, and ocfs2_commit_cache will take care of the error and
prevent the other node to do update first.  And only after recovering
journal it can do the new updates.

The issue discussion mail can be found at:
https://oss.oracle.com/pipermail/ocfs2-devel/2015-June/010856.html
http://comments.gmane.org/gmane.comp.file-systems.ext4/48841

[ Fixed bug in patch which allowed a non-negative error return from
  jbd2_cleanup_journal_tail() to leak out of jbd2_fjournal_flush(); this
  was causing xfstests ext4/306 to fail. -- Ted ]

Reported-by: Yiwen Jiang 
Signed-off-by: Joseph Qi 
Signed-off-by: Theodore Ts'o 
Tested-by: Yiwen Jiang 
Cc: Junxiao Bi 
Signed-off-by: Zefan Li

jbd2: use GFP_NOFS in jbd2_cleanup_journal_tail()

2015-10-22T01:20:04+00:00

commit b4f1afcd068f6e533230dfed00782cd8a907f96b upstream.

jbd2_cleanup_journal_tail() can be invoked by jbd2__journal_start()
So allocations should be done with GFP_NOFS

[Full stack trace snipped from 3.10-rh7]
[] dump_stack+0x19/0x1b
[] warn_slowpath_common+0x61/0x80
[] warn_slowpath_null+0x1a/0x20
[] slab_pre_alloc_hook.isra.31.part.32+0x15/0x17
[] kmem_cache_alloc+0x55/0x210
[] ? mempool_alloc_slab+0x15/0x20
[] mempool_alloc_slab+0x15/0x20
[] mempool_alloc+0x69/0x170
[] ? _raw_spin_unlock_irq+0xe/0x20
[] ? finish_task_switch+0x5d/0x150
[] bio_alloc_bioset+0x1be/0x2e0
[] blkdev_issue_flush+0x99/0x120
[] jbd2_cleanup_journal_tail+0x93/0xa0 [jbd2] -->GFP_KERNEL
[] jbd2_log_do_checkpoint+0x221/0x4a0 [jbd2]
[] __jbd2_log_wait_for_space+0xa7/0x1e0 [jbd2]
[] start_this_handle+0x2d8/0x550 [jbd2]
[] ? __memcg_kmem_put_cache+0x29/0x30
[] ? kmem_cache_alloc+0x130/0x210
[] jbd2__journal_start+0xba/0x190 [jbd2]
[] ? lru_cache_add+0xe/0x10
[] ? ext4_da_write_begin+0xf9/0x330 [ext4]
[] __ext4_journal_start_sb+0x77/0x160 [ext4]
[] ext4_da_write_begin+0xf9/0x330 [ext4]
[] generic_file_buffered_write_iter+0x10c/0x270
[] __generic_file_write_iter+0x178/0x390
[] __generic_file_aio_write+0x8b/0xb0
[] generic_file_aio_write+0x5d/0xc0
[] ext4_file_write+0xa9/0x450 [ext4]
[] ? pipe_read+0x379/0x4f0
[] do_sync_write+0x90/0xe0
[] vfs_write+0xbd/0x1e0
[] SyS_write+0x58/0xb0
[] system_call_fastpath+0x16/0x1b

Signed-off-by: Dmitry Monakhov 
Signed-off-by: Theodore Ts'o 
Signed-off-by: Zefan Li

ext4/jbd2: don't wait (forever) for stale tid caused by wraparound

2014-03-11T23:10:05+00:00

commit d76a3a77113db020d9bb1e894822869410450bd9 upstream.

In the case where an inode has a very stale transaction id (tid) in
i_datasync_tid or i_sync_tid, it's possible that after a very large
(2**31) number of transactions, that the tid number space might wrap,
causing tid_geq()'s calculations to fail.

Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified
by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily",
attempted to fix this problem, but it only avoided kjournald spinning
forever by fixing the logic in jbd2_log_start_commit().

Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c
that might call jbd2_log_start_commit() with a stale tid, those
functions will subsequently call jbd2_log_wait_commit() with the same
stale tid, and then wait for a very long time.  To fix this, we
replace the calls to jbd2_log_start_commit() and
jbd2_log_wait_commit() with a call to a new function,
jbd2_complete_transaction(), which will correctly handle stale tid's.

As a bonus, jbd2_complete_transaction() will avoid locking
j_state_lock for writing unless a commit needs to be started.  This
should have a small (but probably not measurable) improvement for
ext4's scalability.

Signed-off-by: "Theodore Ts'o" 
Reported-by: Ben Hutchings 
Reported-by: George Barnett 
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings 
Cc: Rui Xiang 
Signed-off-by: Greg Kroah-Hartman

jbd2: don't BUG but return ENOSPC if a handle runs out of space

2014-01-08T17:42:12+00:00

commit f6c07cad081ba222d63623d913aafba5586c1d2c upstream.

If a handle runs out of space, we currently stop the kernel with a BUG
in jbd2_journal_dirty_metadata().  This makes it hard to figure out
what might be going on.  So return an error of ENOSPC, so we can let
the file system layer figure out what is going on, to make it more
likely we can get useful debugging information).  This should make it
easier to debug problems such as the one which was reported by:

    https://bugzilla.kernel.org/show_bug.cgi?id=44731

The only two callers of this function are ext4_handle_dirty_metadata()
and ocfs2_journal_dirty().  The ocfs2 function will trigger a
BUG_ON(), which means there will be no change in behavior.  The ext4
function will call ext4_error_inode() which will print the useful
debugging information and then handle the situation using ext4's error
handling mechanisms (i.e., which might mean halting the kernel or
remounting the file system read-only).

Also, since both file systems already call WARN_ON(), drop the WARN_ON
from jbd2_journal_dirty_metadata() to avoid two stack traces from
being displayed.

Signed-off-by: "Theodore Ts'o" 
Cc: ocfs2-devel@oss.oracle.com
Acked-by: Joel Becker 
Signed-off-by: Greg Kroah-Hartman

jbd2: fix theoretical race in jbd2__journal_restart

2013-07-22T01:19:00+00:00

commit 39c04153fda8c32e85b51c96eb5511a326ad7609 upstream.

Once we decrement transaction->t_updates, if this is the last handle
holding the transaction from closing, and once we release the
t_handle_lock spinlock, it's possible for the transaction to commit
and be released.  In practice with normal kernels, this probably won't
happen, since the commit happens in a separate kernel thread and it's
unlikely this could all happen within the space of a few CPU cycles.

On the other hand, with a real-time kernel, this could potentially
happen, so save the tid found in transaction->t_tid before we release
t_handle_lock.  It would require an insane configuration, such as one
where the jbd2 thread was set to a very high real-time priority,
perhaps because a high priority real-time thread is trying to read or
write to a file system.  But some people who use real-time kernels
have been known to do insane things, including controlling
laser-wielding industrial robots.  :-)

Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Greg Kroah-Hartman

jbd2: fix race between jbd2_journal_remove_checkpoint and ->j_commit_callback

2013-05-08T02:51:57+00:00

commit 794446c6946513c684d448205fbd76fa35f38b72 upstream.

The following race is possible:

[kjournald2]                              other_task
jbd2_journal_commit_transaction()
  j_state = T_FINISHED;
  spin_unlock(&journal->j_list_lock);
                                         ->jbd2_journal_remove_checkpoint()
					   ->jbd2_journal_free_transaction();
					     ->kmem_cache_free(transaction)
  ->j_commit_callback(journal, transaction);
    -> USE_AFTER_FREE

WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250()
Hardware name:
list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b
Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
Pid: 16400, comm: jbd2/dm-1-8 Tainted: G        W    3.8.0-rc3+ #107
Call Trace:
 [] warn_slowpath_common+0xad/0xf0
 [] warn_slowpath_fmt+0x46/0x50
 [] ? ext4_journal_commit_callback+0x99/0xc0
 [] __list_del_entry+0x1c0/0x250
 [] ext4_journal_commit_callback+0x6f/0xc0
 [] jbd2_journal_commit_transaction+0x23a6/0x2570
 [] ? try_to_del_timer_sync+0x82/0xa0
 [] ? del_timer_sync+0x91/0x1e0
 [] kjournald2+0x19f/0x6a0
 [] ? wake_up_bit+0x40/0x40
 [] ? bit_spin_lock+0x80/0x80
 [] kthread+0x10e/0x120
 [] ? __init_kthread_worker+0x70/0x70
 [] ret_from_fork+0x7c/0xb0
 [] ? __init_kthread_worker+0x70/0x70

In order to demonstrace this issue one should mount ext4 with mount -o
discard option on SSD disk.  This makes callback longer and race
window becomes wider.

In order to fix this we should mark transaction as finished only after
callbacks have completed

Signed-off-by: Dmitry Monakhov 
Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Greg Kroah-Hartman

jbd2: fix use after free in jbd2_journal_dirty_metadata()

2013-03-28T19:12:15+00:00

commit ad56edad089b56300fd13bb9eeb7d0424d978239 upstream.

jbd2_journal_dirty_metadata() didn't get a reference to journal_head it
was working with. This is OK in most of the cases since the journal head
should be attached to a transaction but in rare occasions when we are
journalling data, __ext4_journalled_writepage() can race with
jbd2_journal_invalidatepage() stripping buffers from a page and thus
journal head can be freed under hands of jbd2_journal_dirty_metadata().

Fix the problem by getting own journal head reference in
jbd2_journal_dirty_metadata() (and also in jbd2_journal_set_triggers()
which can possibly have the same issue).

Reported-by: Zheng Liu 
Signed-off-by: Jan Kara 
Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Greg Kroah-Hartman