linux-stable.git/fs/ext4/inode.c, branch v3.2.78

ext4: fix race between truncate and __ext4_journalled_writepage()

2015-08-12T14:33:15+00:00

commit bdf96838aea6a265f2ae6cbcfb12a778c84a0b8e upstream.

The commit cf108bca465d: "ext4: Invert the locking order of page_lock
and transaction start" caused __ext4_journalled_writepage() to drop
the page lock before the page was written back, as part of changing
the locking order to jbd2_journal_start -> page_lock.  However, this
introduced a potential race if there was a truncate racing with the
data=journalled writeback mode.

Fix this by grabbing the page lock after starting the journal handle,
and then checking to see if page had gotten truncated out from under
us.

This fixes a number of different warnings or BUG_ON's when running
xfstests generic/086 in data=journalled mode, including:

jbd2_journal_dirty_metadata: vdc-8: bad jh for block 115643: transaction (ee3fe7
c0, 164), jh->b_transaction (  (null), 0), jh->b_next_transaction (  (null), 0), jlist 0

	      	      	  - and -

kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200!
    ...
Call Trace:
 [] ? __ext4_journalled_invalidatepage+0x117/0x117
 [] __ext4_journalled_invalidatepage+0x10f/0x117
 [] ? __ext4_journalled_invalidatepage+0x117/0x117
 [] ? lock_buffer+0x36/0x36
 [] ext4_journalled_invalidatepage+0xd/0x22
 [] do_invalidatepage+0x22/0x26
 [] truncate_inode_page+0x5b/0x85
 [] truncate_inode_pages_range+0x156/0x38c
 [] truncate_inode_pages+0x11/0x15
 [] truncate_pagecache+0x55/0x71
 [] ext4_setattr+0x4a9/0x560
 [] ? current_kernel_time+0x10/0x44
 [] notify_change+0x1c7/0x2be
 [] do_truncate+0x65/0x85
 [] ? file_ra_state_init+0x12/0x29

	      	      	  - and -

WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396
irty_metadata+0x14a/0x1ae()
    ...
Call Trace:
 [] ? console_unlock+0x3a1/0x3ce
 [] dump_stack+0x48/0x60
 [] warn_slowpath_common+0x89/0xa0
 [] ? jbd2_journal_dirty_metadata+0x14a/0x1ae
 [] warn_slowpath_null+0x14/0x18
 [] jbd2_journal_dirty_metadata+0x14a/0x1ae
 [] __ext4_handle_dirty_metadata+0xd4/0x19d
 [] write_end_fn+0x40/0x53
 [] ext4_walk_page_buffers+0x4e/0x6a
 [] ext4_writepage+0x354/0x3b8
 [] ? mpage_release_unused_pages+0xd4/0xd4
 [] ? wait_on_buffer+0x2c/0x2c
 [] ? ext4_writepage+0x3b8/0x3b8
 [] __writepage+0x10/0x2e
 [] write_cache_pages+0x22d/0x32c
 [] ? ext4_writepage+0x3b8/0x3b8
 [] ext4_writepages+0x102/0x607
 [] ? sched_clock_local+0x10/0x10e
 [] ? __lock_is_held+0x2e/0x44
 [] ? lock_is_held+0x43/0x51
 [] do_writepages+0x1c/0x29
 [] __writeback_single_inode+0xc3/0x545
 [] writeback_sb_inodes+0x21f/0x36d
    ...

Signed-off-by: Theodore Ts'o 
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings

ext4: fix reservation overflow in ext4_da_write_begin

2014-12-14T16:23:48+00:00

commit 0ff8947fc5f700172b37cbca811a38eb9cb81e08 upstream.

Delalloc write journal reservations only reserve 1 credit,
to update the inode if necessary.  However, it may happen
once in a filesystem's lifetime that a file will cross
the 2G threshold, and require the LARGE_FILE feature to
be set in the superblock as well, if it was not set already.

This overruns the transaction reservation, and can be
demonstrated simply on any ext4 filesystem without the LARGE_FILE
feature already set:

dd if=/dev/zero of=testfile bs=1 seek=2147483646 count=1 \
	conv=notrunc of=testfile
sync
dd if=/dev/zero of=testfile bs=1 seek=2147483647 count=1 \
	conv=notrunc of=testfile

leads to:

EXT4-fs: ext4_do_update_inode:4296: aborting transaction: error 28 in __ext4_handle_dirty_super
EXT4-fs error (device loop0) in ext4_do_update_inode:4301: error 28
EXT4-fs error (device loop0) in ext4_reserve_inode_write:4757: Readonly filesystem
EXT4-fs error (device loop0) in ext4_dirty_inode:4876: error 28
EXT4-fs error (device loop0) in ext4_da_write_end:2685: error 28

Adjust the number of credits based on whether the flag is
already set, and whether the current write may extend past the
LARGE_FILE limit.

Signed-off-by: Eric Sandeen 
Signed-off-by: Theodore Ts'o 
Reviewed-by: Andreas Dilger 
[bwh: Backported to 3.2:
 - ext4_journal_start() doesn't have a type parameter
 - Adjust context]
Signed-off-by: Ben Hutchings

ext4: add ext4_iget_normal() which is to be used for dir tree lookups

2014-12-14T16:23:47+00:00

commit f4bb2981024fc91b23b4d09a8817c415396dbabb upstream.

If there is a corrupted file system which has directory entries that
point at reserved, metadata inodes, prohibit them from being used by
treating them the same way we treat Boot Loader inodes --- that is,
mark them to be bad inodes.  This prohibits them from being opened,
deleted, or modified via chmod, chown, utimes, etc.

In particular, this prevents a corrupted file system which has a
directory entry which points at the journal inode from being deleted
and its blocks released, after which point Much Hilarity Ensues.

Reported-by: Sami Liedes 
Signed-off-by: Theodore Ts'o 
Signed-off-by: Ben Hutchings

ext4: don't orphan or truncate the boot loader inode

2014-12-14T16:23:46+00:00

commit e2bfb088fac03c0f621886a04cffc7faa2b49b1d upstream.

The boot loader inode (inode #5) should never be visible in the
directory hierarchy, but it's possible if the file system is corrupted
that there will be a directory entry that points at inode #5.  In
order to avoid accidentally trashing it, when such a directory inode
is opened, the inode will be marked as a bad inode, so that it's not
possible to modify (or read) the inode from userspace.

Unfortunately, when we unlink this (invalid/illegal) directory entry,
we will put the bad inode on the ophan list, and then when try to
unlink the directory, we don't actually remove the bad inode from the
orphan list before freeing in-memory inode structure.  This means the
in-memory orphan list is corrupted, leading to a kernel oops.

In addition, avoid truncating a bad inode in ext4_destroy_inode(),
since truncating the boot loader inode is not a smart thing to do.

Reported-by: Sami Liedes 
Reviewed-by: Jan Kara 
Signed-off-by: Theodore Ts'o 
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings

ext4: FIBMAP ioctl causes BUG_ON due to handle EXT_MAX_BLOCKS

2014-05-18T13:58:02+00:00

commit 4adb6ab3e0fa71363a5ef229544b2d17de6600d7 upstream.

When we try to get 2^32-1 block of the file which has the extent
(ee_block=2^32-2, ee_len=1) with FIBMAP ioctl, it causes BUG_ON
in ext4_ext_put_gap_in_cache().

To avoid the problem, ext4_map_blocks() needs to check the file logical block
number. ext4_ext_put_gap_in_cache() called via ext4_map_blocks() cannot
handle 2^32-1 because the maximum file logical block number is 2^32-2.

Note that ext4_ind_map_blocks() returns -EIO when the block number is invalid.
So ext4_map_blocks() should also return the same errno.

Signed-off-by: Kazuya Mio 
Signed-off-by: "Theodore Ts'o" 
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings

ext4: atomically set inode->i_flags in ext4_set_inode_flags()

2014-04-09T01:20:44+00:00

commit 00a1a053ebe5febcfc2ec498bd894f035ad2aa06 upstream.

Use cmpxchg() to atomically set i_flags instead of clearing out the
S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the
EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race
where an immutable file has the immutable flag cleared for a brief
window of time.

Reported-by: John Sullivan 
Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Linus Torvalds 
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings

ext4: fix overflow when counting used blocks on 32-bit architectures

2013-07-27T04:34:34+00:00

commit 8af8eecc1331dbf5e8c662022272cf667e213da5 upstream.

The arithmetics adding delalloc blocks to the number of used blocks in
ext4_getattr() can easily overflow on 32-bit archs as we first multiply
number of blocks by blocksize and then divide back by 512. Make the
arithmetics more clever and also use proper type (unsigned long long
instead of unsigned long).

Signed-off-by: Jan Kara 
Signed-off-by: Theodore Ts'o 
Signed-off-by: Ben Hutchings

ext4/jbd2: don't wait (forever) for stale tid caused by wraparound

2013-05-13T14:02:14+00:00

commit d76a3a77113db020d9bb1e894822869410450bd9 upstream.

In the case where an inode has a very stale transaction id (tid) in
i_datasync_tid or i_sync_tid, it's possible that after a very large
(2**31) number of transactions, that the tid number space might wrap,
causing tid_geq()'s calculations to fail.

Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified
by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily",
attempted to fix this problem, but it only avoided kjournald spinning
forever by fixing the logic in jbd2_log_start_commit().

Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c
that might call jbd2_log_start_commit() with a stale tid, those
functions will subsequently call jbd2_log_wait_commit() with the same
stale tid, and then wait for a very long time.  To fix this, we
replace the calls to jbd2_log_start_commit() and
jbd2_log_wait_commit() with a call to a new function,
jbd2_complete_transaction(), which will correctly handle stale tid's.

As a bonus, jbd2_complete_transaction() will avoid locking
j_state_lock for writing unless a commit needs to be started.  This
should have a small (but probably not measurable) improvement for
ext4's scalability.

Signed-off-by: "Theodore Ts'o" 
Reported-by: Ben Hutchings 
Reported-by: George Barnett 
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings

ext4: fix data=journal fast mount/umount hang

2013-03-27T02:41:18+00:00

commit 2b405bfa84063bfa35621d2d6879f52693c614b0 upstream.

In data=journal mode, if we unmount the file system before a
transaction has a chance to complete, when the journal inode is being
evicted, we can end up calling into jbd2_log_wait_commit() for the
last transaction, after the journalling machinery has been shut down.

Arguably we should adjust ext4_should_journal_data() to return FALSE
for the journal inode, but the only place it matters is
ext4_evict_inode(), and so to save a bit of CPU time, and to make the
patch much more obviously correct by inspection(tm), we'll fix it by
explicitly not trying to waiting for a journal commit when we are
evicting the journal inode, since it's guaranteed to never succeed in
this case.

This can be easily replicated via:

     mount -t ext4 -o data=journal /dev/vdb /vdb ; umount /vdb

------------[ cut here ]------------
WARNING: at /usr/projects/linux/ext4/fs/jbd2/journal.c:542 __jbd2_log_start_commit+0xba/0xcd()
Hardware name: Bochs
JBD2: bad log_start_commit: 3005630206 3005630206 0 0
Modules linked in:
Pid: 2909, comm: umount Not tainted 3.8.0-rc3 #1020
Call Trace:
 [] warn_slowpath_common+0x68/0x7d
 [] ? __jbd2_log_start_commit+0xba/0xcd
 [] warn_slowpath_fmt+0x2b/0x2f
 [] __jbd2_log_start_commit+0xba/0xcd
 [] jbd2_log_start_commit+0x24/0x34
 [] ext4_evict_inode+0x71/0x2e3
 [] evict+0x94/0x135
 [] iput+0x10a/0x110
 [] jbd2_journal_destroy+0x190/0x1ce
 [] ? bit_waitqueue+0x50/0x50
 [] ext4_put_super+0x52/0x294
 [] generic_shutdown_super+0x48/0xb4
 [] kill_block_super+0x22/0x60
 [] deactivate_locked_super+0x22/0x49
 [] deactivate_super+0x30/0x33
 [] mntput_no_expire+0x107/0x10c
 [] sys_umount+0x2cf/0x2e0
 [] sys_oldumount+0x12/0x14
 [] syscall_call+0x7/0xb
---[ end trace 6a954cc790501c1f ]---
jbd2_log_wait_commit: error: j_commit_request=-1289337090, tid=0

Signed-off-by: "Theodore Ts'o" 
Reviewed-by: Jan Kara 
Signed-off-by: Ben Hutchings

ext4: fix possible use-after-free with AIO

2013-03-06T03:23:45+00:00

commit 091e26dfc156aeb3b73bc5c5f277e433ad39331c upstream.

Running AIO is pinning inode in memory using file reference. Once AIO
is completed using aio_complete(), file reference is put and inode can
be freed from memory. So we have to be sure that calling aio_complete()
is the last thing we do with the inode.

Reviewed-by: Carlos Maiolino 
Acked-by: Jeff Moyer 
Signed-off-by: Jan Kara 
Signed-off-by: "Theodore Ts'o" 
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings