linux-stable.git/fs/ext4, branch v3.2.43

ext4: fix data=journal fast mount/umount hang

2013-03-27T02:41:18+00:00

commit 2b405bfa84063bfa35621d2d6879f52693c614b0 upstream.

In data=journal mode, if we unmount the file system before a
transaction has a chance to complete, when the journal inode is being
evicted, we can end up calling into jbd2_log_wait_commit() for the
last transaction, after the journalling machinery has been shut down.

Arguably we should adjust ext4_should_journal_data() to return FALSE
for the journal inode, but the only place it matters is
ext4_evict_inode(), and so to save a bit of CPU time, and to make the
patch much more obviously correct by inspection(tm), we'll fix it by
explicitly not trying to waiting for a journal commit when we are
evicting the journal inode, since it's guaranteed to never succeed in
this case.

This can be easily replicated via:

     mount -t ext4 -o data=journal /dev/vdb /vdb ; umount /vdb

------------[ cut here ]------------
WARNING: at /usr/projects/linux/ext4/fs/jbd2/journal.c:542 __jbd2_log_start_commit+0xba/0xcd()
Hardware name: Bochs
JBD2: bad log_start_commit: 3005630206 3005630206 0 0
Modules linked in:
Pid: 2909, comm: umount Not tainted 3.8.0-rc3 #1020
Call Trace:
 [] warn_slowpath_common+0x68/0x7d
 [] ? __jbd2_log_start_commit+0xba/0xcd
 [] warn_slowpath_fmt+0x2b/0x2f
 [] __jbd2_log_start_commit+0xba/0xcd
 [] jbd2_log_start_commit+0x24/0x34
 [] ext4_evict_inode+0x71/0x2e3
 [] evict+0x94/0x135
 [] iput+0x10a/0x110
 [] jbd2_journal_destroy+0x190/0x1ce
 [] ? bit_waitqueue+0x50/0x50
 [] ext4_put_super+0x52/0x294
 [] generic_shutdown_super+0x48/0xb4
 [] kill_block_super+0x22/0x60
 [] deactivate_locked_super+0x22/0x49
 [] deactivate_super+0x30/0x33
 [] mntput_no_expire+0x107/0x10c
 [] sys_umount+0x2cf/0x2e0
 [] sys_oldumount+0x12/0x14
 [] syscall_call+0x7/0xb
---[ end trace 6a954cc790501c1f ]---
jbd2_log_wait_commit: error: j_commit_request=-1289337090, tid=0

Signed-off-by: "Theodore Ts'o" 
Reviewed-by: Jan Kara 
Signed-off-by: Ben Hutchings

ext4: use atomic64_t for the per-flexbg free_clusters count

2013-03-27T02:41:12+00:00

commit 90ba983f6889e65a3b506b30dc606aa9d1d46cd2 upstream.

A user who was using a 8TB+ file system and with a very large flexbg
size (> 65536) could cause the atomic_t used in the struct flex_groups
to overflow.  This was detected by PaX security patchset:

http://forums.grsecurity.net/viewtopic.php?f=3&t=3289&p=12551#p12551

This bug was introduced in commit 9f24e4208f7e, so it's been around
since 2.6.30.  :-(

Fix this by using an atomic64_t for struct orlav_stats's
free_clusters.

Signed-off-by: "Theodore Ts'o" 
Reviewed-by: Lukas Czerner 
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings

ext4: convert number of blocks to clusters properly

2013-03-27T02:41:12+00:00

commit 810da240f221d64bf90020f25941b05b378186fe upstream.

We're using macro EXT4_B2C() to convert number of blocks to number of
clusters for bigalloc file systems.  However, we should be using
EXT4_NUM_B2C().

Signed-off-by: Lukas Czerner 
Signed-off-by: "Theodore Ts'o" 
[bwh: Backported to 3.2:
 - Adjust context
 - Drop changes in ext4_setup_new_descs() and ext4_calculate_overhead()]
Signed-off-by: Ben Hutchings

ext4: fix the wrong number of the allocated blocks in ext4_split_extent()

2013-03-27T02:41:11+00:00

commit 3a2256702e47f68f921dfad41b1764d05c572329 upstream.

This commit fixes a wrong return value of the number of the allocated
blocks in ext4_split_extent.  When the length of blocks we want to
allocate is greater than the length of the current extent, we return a
wrong number.  Let's see what happens in the following case when we
call ext4_split_extent().

  map: [48, 72]
  ex:  [32, 64, u]

'ex' will be split into two parts:
  ex1: [32, 47, u]
  ex2: [48, 64, w]

'map->m_len' is returned from this function, and the value is 24.  But
the real length is 16.  So it should be fixed.

Meanwhile in this commit we use right length of the allocated blocks
when get_reserved_cluster_alloc in ext4_ext_handle_uninitialized_extents
is called.

Signed-off-by: Zheng Liu 
Signed-off-by: "Theodore Ts'o" 
Cc: Dmitry Monakhov 
Signed-off-by: Ben Hutchings

ext4: fix kernel BUG on large-scale rm -rf commands

2013-03-06T03:24:30+00:00

commit 89a4e48f8479f8145eca9698f39fe188c982212f upstream.

Commit 968dee7722: "ext4: fix hole punch failure when depth is greater
than 0" introduced a regression in v3.5.1/v3.6-rc1 which caused kernel
crashes when users ran run "rm -rf" on large directory hierarchy on
ext4 filesystems on RAID devices:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000028

    Process rm (pid: 18229, threadinfo ffff8801276bc000, task ffff880123631710)
    Call Trace:
     [] ? __ext4_handle_dirty_metadata+0x83/0x110
     [] ext4_ext_truncate+0x193/0x1d0
     [] ? ext4_mark_inode_dirty+0x7f/0x1f0
     [] ext4_truncate+0xf5/0x100
     [] ext4_evict_inode+0x461/0x490
     [] evict+0xa2/0x1a0
     [] iput+0x103/0x1f0
     [] do_unlinkat+0x154/0x1c0
     [] ? sys_newfstatat+0x2a/0x40
     [] sys_unlinkat+0x1b/0x50
     [] system_call_fastpath+0x16/0x1b
    Code: 8b 4d 20 0f b7 41 02 48 8d 04 40 48 8d 04 81 49 89 45 18 0f b7 49 02 48 83 c1 01 49 89 4d 00 e9 ae f8 ff ff 0f 1f 00 49 8b 45 28 <48> 8b 40 28 49 89 45 20 e9 85 f8 ff ff 0f 1f 80 00 00 00

    RIP  [] ext4_ext_remove_space+0xa34/0xdf0

This could be reproduced as follows:

The problem in commit 968dee7722 was that caused the variable 'i' to
be left uninitialized if the truncate required more space than was
available in the journal.  This resulted in the function
ext4_ext_truncate_extend_restart() returning -EAGAIN, which caused
ext4_ext_remove_space() to restart the truncate operation after
starting a new jbd2 handle.

Reported-by: Maciej Żenczykowski 
Reported-by: Marti Raudsepp 
Tested-by: Fengguang Wu 
Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Ben Hutchings

ext4: fix hole punch failure when depth is greater than 0

2013-03-06T03:24:29+00:00

commit 968dee77220768a5f52cf8b21d0bdb73486febef upstream.

Whether to continue removing extents or not is decided by the return
value of function ext4_ext_more_to_rm() which checks 2 conditions:
a) if there are no more indexes to process.
b) if the number of entries are decreased in the header of "depth -1".

In case of hole punch, if the last block to be removed is not part of
the last extent index than this index will not be deleted, hence the
number of valid entries in the extent header of "depth - 1" will
remain as it is and ext4_ext_more_to_rm will return 0 although the
required blocks are not yet removed.

This patch fixes the above mentioned problem as instead of removing
the extents from the end of file, it starts removing the blocks from
the particular extent from which removing blocks is actually required
and continue backward until done.

Signed-off-by: Ashish Sangwan 
Signed-off-by: Namjae Jeon 
Reviewed-by: Lukas Czerner 
Signed-off-by: Greg Kroah-Hartman 
Signed-off-by: Ben Hutchings

ext4: rewrite punch hole to use ext4_ext_remove_space()

2013-03-06T03:24:29+00:00

commit 5f95d21fb6f2aaa52830e5b7fb405f6c71d3ab85 upstream.

This commit rewrites ext4 punch hole implementation to use
ext4_ext_remove_space() instead of its home gown way of doing this via
ext4_ext_map_blocks(). There are several reasons for changing this.

Firstly it is quite non obvious that punching hole needs to
ext4_ext_map_blocks() to punch a hole, especially given that this
function should map blocks, not unmap it. It also required a lot of new
code in ext4_ext_map_blocks().

Secondly the design of it is not very effective. The reason is that we
are trying to punch out blocks in ext4_ext_punch_hole() in opposite
direction than in ext4_ext_rm_leaf() which causes the ext4_ext_rm_leaf()
to iterate through the whole tree from the end to the start to find the
requested extent for every extent we are going to punch out.

And finally the current implementation does not use the existing code,
but bring a lot of new code, which is IMO unnecessary since there
already is some infrastructure we can use. Specifically
ext4_ext_remove_space().

This commit changes ext4_ext_remove_space() to accept 'end' parameter so
we can not only truncate to the end of file, but also remove the space
in the middle of the file (punch a hole). Moreover, because the last
block to punch out, might be in the middle of the extent, we have to
split the extent at 'end + 1' so ext4_ext_rm_leaf() can easily either
remove the whole fist part of split extent, or change its size.

ext4_ext_remove_space() is then used to actually remove the space
(extents) from within the hole, instead of ext4_ext_map_blocks().

Note that this also fix the issue with punch hole, where we would forget
to remove empty index blocks from the extent tree, resulting in double
free block error and file system corruption. This is simply because we
now use different code path, where this problem does not exist.

This has been tested with fsx running for several days and xfstests,
plus xfstest #251 with '-o discard' run on the loop image (which
converts discard requestes into punch hole to the backing file). All of
it on 1K and 4K file system block size.

Signed-off-by: Lukas Czerner 
Signed-off-by: "Theodore Ts'o" 
[bwh: Backported to 3.2.y: move EXT4_EXT_DATA_VALID{1,2} along with the
 other extent splitting flags]
Signed-off-by: Ben Hutchings

ext4: fix free clusters calculation in bigalloc filesystem

2013-03-06T03:24:10+00:00

commit 304e220f0879198b1f5309ad6f0be862b4009491 upstream.

ext4_has_free_clusters() should tell us whether there is enough free
clusters to allocate, however number of free clusters in the file system
is converted to blocks using EXT4_C2B() which is not only wrong use of
the macro (we should have used EXT4_NUM_B2C) but it's also completely
wrong concept since everything else is in cluster units.

Moreover when calculating number of root clusters we should be using
macro EXT4_NUM_B2C() instead of EXT4_B2C() otherwise the result might be
off by one. However r_blocks_count should always be a multiple of the
cluster ratio so doing a plain bit shift should be enough here. We
avoid using EXT4_B2C() because it's confusing.

As a result of the first problem number of free clusters is much bigger
than it should have been and ext4_has_free_clusters() would return 1 even
if there is really not enough free clusters available.

Fix this by removing the EXT4_C2B() conversion of free clusters and
using bit shift when calculating number of root clusters. This bug
affects number of xfstests tests covering file system ENOSPC situation
handling. With this patch most of the ENOSPC problems with bigalloc file
system disappear, especially the errors caused by delayed allocation not
having enough space when the actual allocation is finally requested.

Signed-off-by: Lukas Czerner 
Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Ben Hutchings

ext4: fix xattr block allocation/release with bigalloc

2013-03-06T03:24:03+00:00

commit 1231b3a1eb5740192aeebf5344dd6d6da000febf upstream.

Currently when new xattr block is created or released we we would call
dquot_free_block() or dquot_alloc_block() respectively, among the else
decrementing or incrementing the number of blocks assigned to the
inode by one block.

This however does not work for bigalloc file system because we always
allocate/free the whole cluster so we have to count with that in
dquot_free_block() and dquot_alloc_block() as well.

Use the clusters-to-blocks conversion EXT4_C2B() when passing number of
blocks to the dquot_alloc/free functions to fix the problem.

The problem has been revealed by xfstests #117 (and possibly others).

Signed-off-by: Lukas Czerner 
Signed-off-by: "Theodore Ts'o" 
Reviewed-by: Eric Sandeen 
Signed-off-by: Ben Hutchings

ext4: fix race in ext4_mb_add_n_trim()

2013-03-06T03:23:47+00:00

commit f1167009711032b0d747ec89a632a626c901a1ad upstream.

In ext4_mb_add_n_trim(), lg_prealloc_lock should be taken when
changing the lg_prealloc_list.

Signed-off-by: Niu Yawei 
Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Ben Hutchings