linux-stable.git/fs/ext4, branch linux-2.6.27.y

ext4: avoid hangs in ext4_da_should_update_i_disksize()

2012-02-11T14:38:16+00:00

commit ea51d132dbf9b00063169c1159bee253d9649224 upstream.

If the pte mapping in generic_perform_write() is unmapped between
iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic(), the
"copied" parameter to ->end_write can be zero. ext4 couldn't cope with
it with delayed allocations enabled. This skips the i_disksize
enlargement logic if copied is zero and no new data was appeneded to
the inode.

 gdb> bt
 #0  0xffffffff811afe80 in ext4_da_should_update_i_disksize (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x1\
 08000, len=0x1000, copied=0x0, page=0xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2467
 #1  ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\
 xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512
 #2  0xffffffff810d97f1 in generic_perform_write (iocb=, iov=, nr_segs=, pos=0x108000, ppos=0xffff88001e26be40, count=, written=0x0) at mm/filemap.c:2440
 #3  generic_file_buffered_write (iocb=, iov=, nr_segs=, p\
 os=0x108000, ppos=0xffff88001e26be40, count=, written=0x0) at mm/filemap.c:2482
 #4  0xffffffff810db5d1 in __generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, ppos=0\
 xffff88001e26be40) at mm/filemap.c:2600
 #5  0xffffffff810db853 in generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=, pos=) at mm/filemap.c:2632
 #6  0xffffffff811a71aa in ext4_file_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, pos=0x108000) a\
 t fs/ext4/file.c:136
 #7  0xffffffff811375aa in do_sync_write (filp=0xffff88003f606a80, buf=, len=, \
 ppos=0xffff88001e26bf48) at fs/read_write.c:406
 #8  0xffffffff81137e56 in vfs_write (file=0xffff88003f606a80, buf=0x1ec2960 , count=0x4\
 000, pos=0xffff88001e26bf48) at fs/read_write.c:435
 #9  0xffffffff8113816c in sys_write (fd=, buf=0x1ec2960 , count=0x\
 4000) at fs/read_write.c:487
 #10 
 #11 0x00007f120077a390 in __brk_reservation_fn_dmi_alloc__ ()
 #12 0x0000000000000000 in ?? ()
 gdb> print offset
 $22 = 0xffffffffffffffff
 gdb> print idx
 $23 = 0xffffffff
 gdb> print inode->i_blkbits
 $24 = 0xc
 gdb> up
 #1  ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\
 xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512
 2512                    if (ext4_da_should_update_i_disksize(page, end)) {
 gdb> print start
 $25 = 0x0
 gdb> print end
 $26 = 0xffffffffffffffff
 gdb> print pos
 $27 = 0x108000
 gdb> print new_i_size
 $28 = 0x108000
 gdb> print ((struct ext4_inode_info *)((char *)inode-((int)(&((struct ext4_inode_info *)0)->vfs_inode))))->i_disksize
 $29 = 0xd9000
 gdb> down
 2467            for (i = 0; i < idx; i++)
 gdb> print i
 $30 = 0xd44acbee

This is 100% reproducible with some autonuma development code tuned in
a very aggressive manner (not normal way even for knumad) which does
"exotic" changes to the ptes. It wouldn't normally trigger but I don't
see why it can't happen normally if the page is added to swap cache in
between the two faults leading to "copied" being zero (which then
hangs in ext4). So it should be fixed. Especially possible with lumpy
reclaim (albeit disabled if compaction is enabled) as that would
ignore the young bits in the ptes.

Signed-off-by: Andrea Arcangeli 
Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Greg Kroah-Hartman 
Signed-off-by: Willy Tarreau

ext4: fix BUG_ON() in ext4_ext_insert_extent()

2012-02-11T14:37:53+00:00

Does not corrispond with a direct commit in Linus's tree as it was fixed
differently in the 3.0 release.

We will meet with a BUG_ON() if following script is run.

mkfs.ext4 -b 4096 /dev/sdb1 1000000
mount -t ext4 /dev/sdb1 /mnt/sdb1
fallocate -l 100M /mnt/sdb1/test
sync
for((i=0;i<170;i++))
do
        dd if=/dev/zero of=/mnt/sdb1/test conv=notrunc bs=256k count=1
seek=`expr $i \* 2`
done
umount /mnt/sdb1
mount -t ext4 /dev/sdb1 /mnt/sdb1
dd if=/dev/zero of=/mnt/sdb1/test conv=notrunc bs=256k count=1 seek=341
umount /mnt/sdb1
mount /dev/sdb1 /mnt/sdb1
dd if=/dev/zero of=/mnt/sdb1/test conv=notrunc bs=256k count=1 seek=340
sync

The reason is that it forgot to mark dirty when splitting two extents in
ext4_ext_convert_to_initialized(). Althrough ex has been updated in
memory, it is not dirtied both in ext4_ext_convert_to_initialized() and
ext4_ext_insert_extent(). The disk layout is corrupted. Then it will
meet with a BUG_ON() when writting at the start of that extent again.

Cc: "Theodore Ts'o" 
Cc: Xiaoyun Mao 
Cc: Yingbin Wang 
Cc: Jia Wan 
Signed-off-by: Zheng Liu 
Signed-off-by: Greg Kroah-Hartman 
Signed-off-by: Willy Tarreau

ext4: fix credits computing for indirect mapped files

2011-04-30T14:53:34+00:00

commit 5b41395fcc0265fc9f193aef9df39ce49d64677c upstream.

When writing a contiguous set of blocks, two indirect blocks could be
needed depending on how the blocks are aligned, so we need to increase
the number of credits needed by one.

[ Also fixed a another bug which could further underestimate the
  number of journal credits needed by 1; the code was using integer
  division instead of DIV_ROUND_UP() -- tytso]

Signed-off-by: Yongqiang Yang 
Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Greg Kroah-Hartman

ext4: Implement range_cyclic in ext4_da_writepages instead of write_cache_pages

2010-07-05T18:08:46+00:00

commit 2acf2c261b823d9d9ed954f348b97620297a36b5 upstream.

With delayed allocation we lock the page in write_cache_pages() and
try to build an in memory extent of contiguous blocks.  This is needed
so that we can get large contiguous blocks request.  If range_cyclic
mode is enabled, write_cache_pages() will loop back to the 0 index if
no I/O has been done yet, and try to start writing from the beginning
of the range.  That causes an attempt to take the page lock of lower
index page while holding the page lock of higher index page, which can
cause a dead lock with another writeback thread.

The solution is to implement the range_cyclic behavior in
ext4_da_writepages() instead.

http://bugzilla.kernel.org/show_bug.cgi?id=12579

Signed-off-by: Aneesh Kumar K.V 
Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Jayson R. King 
Signed-off-by: Greg Kroah-Hartman

ext4: Fix file fragmentation during large file write.

2010-07-05T18:08:45+00:00

commit 22208dedbd7626e5fc4339c417f8d24cc21f79d7 upstream.

The range_cyclic writeback mode uses the address_space writeback_index
as the start index for writeback.  With delayed allocation we were
updating writeback_index wrongly resulting in highly fragmented file.
This patch reduces the number of extents reduced from 4000 to 27 for a
3GB file.

Signed-off-by: Aneesh Kumar K.V 
Signed-off-by: Theodore Ts'o 
[dev@jaysonking.com: Some changed lines from the original version of this patch were dropped, since they were rolled up with another cherry-picked patch applied to 2.6.27.y earlier.]
[dev@jaysonking.com: Use of wbc->no_nrwrite_index_update was dropped, since write_cache_pages_da() implies it.]
Signed-off-by: Jayson R. King 
Signed-off-by: Greg Kroah-Hartman

ext4: Use our own write_cache_pages()

2010-07-05T18:08:45+00:00

commit 8e48dcfbd7c0892b4cfd064d682cc4c95a29df32 upstream.

Make a copy of write_cache_pages() for the benefit of
ext4_da_writepages().  This allows us to simplify the code some, and
will allow us to further customize the code in future patches.

There are some nasty hacks in write_cache_pages(), which Linus has
(correctly) characterized as vile.  I've just copied it into
write_cache_pages_da(), without trying to clean those bits up lest I
break something in the ext4's delalloc implementation, which is a bit
fragile right now.  This will allow Dave Chinner to clean up
write_cache_pages() in mm/page-writeback.c, without worrying about
breaking ext4.  Eventually write_cache_pages_da() will go away when I
rewrite ext4's delayed allocation and create a general
ext4_writepages() which is used for all of ext4's writeback.  Until
now this is the lowest risk way to clean up the core
write_cache_pages() function.

Signed-off-by: "Theodore Ts'o" 
Cc: Dave Chinner 
[dev@jaysonking.com: Dropped the hunks which reverted the use of no_nrwrite_index_update, since those lines weren't ever created on 2.6.27.y]
[dev@jaysonking.com: Copied from 2.6.27.y's version of write_cache_pages(), plus the changes to it from patch "vfs: Add no_nrwrite_index_update writeback control flag"]
Signed-off-by: Jayson R. King 
Signed-off-by: Greg Kroah-Hartman

ext4: check s_log_groups_per_flex in online resize code

2010-07-05T18:08:45+00:00

commit 42007efd569f1cf3bfb9a61da60ef6c2179508ca upstream.

If groups_per_flex < 2, sbi->s_flex_groups[] doesn't get filled out,
and every other access to this first tests s_log_groups_per_flex;
same thing needs to happen in resize or we'll wander off into
a null pointer when doing an online resize of the file system.

Thanks to Christoph Biedl, who came up with the trivial testcase:

# truncate --size 128M fsfile
# mkfs.ext3 -F fsfile
# tune2fs -O extents,uninit_bg,dir_index,flex_bg,huge_file,dir_nlink,extra_isize fsfile
# e2fsck -yDf -C0 fsfile
# truncate --size 132M fsfile
# losetup /dev/loop0 fsfile
# mount /dev/loop0 mnt
# resize2fs -p /dev/loop0

	https://bugzilla.kernel.org/show_bug.cgi?id=13549

Reported-by: Alessandro Polverini 
Test-case-by: Christoph Biedl  
Signed-off-by: Eric Sandeen 
Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Greg Kroah-Hartman

ext4: Use tag dirty lookup during mpage_da_submit_io

2010-05-26T21:27:06+00:00

commit af6f029d3836eb7264cd3fbb13a6baf0e5fdb5ea upstream.

This enables us to drop the range_cont writeback mode
use from ext4_da_writepages.

Signed-off-by: Aneesh Kumar K.V 
Signed-off-by: Jayson R. King 
Signed-off-by: Theodore Ts'o 
Signed-off-by: Greg Kroah-Hartman

ext4: Retry block allocation if we have free blocks left

2010-05-26T21:27:06+00:00

commit df22291ff0fde0d350cf15dac3e5cc33ac528875 upstream.

When we truncate files, the meta-data blocks released are not reused
untill we commit the truncate transaction.  That means delayed get_block
request will return ENOSPC even if we have free blocks left.  Force a
journal commit and retry block allocation if we get ENOSPC with free
blocks left.

Signed-off-by: Aneesh Kumar K.V 
Signed-off-by: Mingming Cao 
Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Jayson R. King 
Signed-off-by: Theodore Ts'o 
Signed-off-by: Greg Kroah-Hartman

ext4: Retry block reservation

2010-05-26T21:27:06+00:00

commit 030ba6bc67b4f2bc5cd174f57785a1745c929abe upstream.

During block reservation if we don't have enough blocks left, retry
block reservation with smaller block counts.  This makes sure we try
fallocate and DIO with smaller request size and don't fail early.  The
delayed allocation reservation cannot try with smaller block count. So
retry block reservation to handle temporary disk full conditions.  Also
print free blocks details if we fail block allocation during writepages.

Signed-off-by: Aneesh Kumar K.V 
Signed-off-by: Mingming Cao 
Signed-off-by: "Theodore Ts'o" 
Signed-off-by: Jayson R. King 
Signed-off-by: Theodore Ts'o 
Signed-off-by: Greg Kroah-Hartman