linux-stable.git/fs/btrfs, branch linux-3.8.y

Btrfs: fix extent logging with O_DIRECT into prealloc

2013-05-11T20:54:10+00:00

commit eb384b55ae9c2055ea00c5cc87971e182d47aefa upstream.

This is the same as the fix from commit

Btrfs: fix bad extent logging

but for O_DIRECT.  I missed this when I fixed the problem originally, we were
still using the em for the orig_start and orig_block_len, which would be the
merged extent.  We need to use the actual extent from the on disk file extent
item, which we have to lookup to make sure it's ok to nocow anyway so just pass
in some pointers to hold this info.  Thanks,

Signed-off-by: Josef Bacik 
Signed-off-by: Greg Kroah-Hartman

Btrfs: compare relevant parts of delayed tree refs

2013-05-11T20:54:09+00:00

commit 41b0fc42800569f63e029549b75c4c9cb63f2dfd upstream.

A user reported a panic while running a balance.  What was happening was he was
relocating a block, which added the reference to the relocation tree.  Then
relocation would walk through the relocation tree and drop that reference and
free that block, and then it would walk down a snapshot which referenced the
same block and add another ref to the block.  The problem is this was all
happening in the same transaction, so the parent block was free'ed up when we
drop our reference which was immediately available for allocation, and then it
was used _again_ to add a reference for the same block from a different
snapshot.  This resulted in something like this in the delayed ref tree

add ref to 90234880, parent=2067398656, ref_root 1766, level 1
del ref to 90234880, parent=2067398656, ref_root 18446744073709551608, level 1
add ref to 90234880, parent=2067398656, ref_root 1767, level 1

as you can see the ref_root's don't match, because when we inc the ref we use
the header owner, which is the original tree the block belonged to, instead of
the data reloc tree.  Then when we remove the extent we use the reloc tree
objectid.  But none of this matters, since it is a shared reference which means
only the parent matters.  When the delayed ref stuff runs it adds all the
increments first, and then does all the drops, to make sure that we don't delete
the ref if we net a positive ref count.  But tree blocks aren't allowed to have
multiple refs from the same block, so this panics when it tries to add the
second ref.  We need the add and the drop to cancel each other out in memory so
we only do the final add.

So to fix this we need to adjust how the delayed refs are added to the tree.
Only the ref_root matters when it is a normal backref, and only the parent
matters when it is a shared backref.  So make our decision based on what ref
type we have.  This allows us to keep the ref_root in memory in case anybody
wants to use it for something else, and it allows the delayed refs to be merged
properly so we don't end up with this panic.

With this patch the users image no longer panics on mount, and it has a clean
fsck after a normal mount/umount cycle.  Thanks,

Reported-by: Roman Mamedov 
Signed-off-by: Josef Bacik 
Signed-off-by: Greg Kroah-Hartman

Btrfs: make sure nbytes are right after log replay

2013-04-25T19:51:24+00:00

commit 4bc4bee4595662d8bff92180d5c32e3313a704b0 upstream.

While trying to track down a tree log replay bug I noticed that fsck was always
complaining about nbytes not being right for our fsynced file.  That is because
the new fsync stuff doesn't wait for ordered extents to complete, so the inodes
nbytes are not necessarily updated properly when we log it.  So to fix this we
need to set nbytes to whatever it is on the inode that is on disk, so when we
replay the extents we can just add the bytes that are being added as we replay
the extent.  This makes it work for the case that we have the wrong nbytes or
the case that we logged everything and nbytes is actually correct.  With this
I'm no longer getting nbytes errors out of btrfsck.

Signed-off-by: Josef Bacik 
Signed-off-by: Chris Mason 
Signed-off-by: Lingzhu Xiang 
Reviewed-by: CAI Qian 
Signed-off-by: Greg Kroah-Hartman

Btrfs: fix space leak when we fail to reserve metadata space

2013-04-05T16:26:16+00:00

commit f4881bc7a83eff263789dd524b7c269d138d4af5 upstream.

Dave reported a warning when running xfstest 275.  We have been leaking delalloc
metadata space when our reservations fail.  This is because we were improperly
calculating how much space to free for our checksum reservations.  The problem
is we would sometimes free up space that had already been freed in another
thread and we would end up with negative usage for the delalloc space.  This
patch fixes the problem by calculating how much space the other threads would
have already freed, and then calculate how much space we need to free had we not
done the reservation at all, and then freeing any excess space.  This makes
xfstests 275 no longer have leaked space.  Thanks

Reported-by: David Sterba 
Signed-off-by: Josef Bacik 
Signed-off-by: Lingzhu Xiang 
Reviewed-by: CAI Qian 
Signed-off-by: Greg Kroah-Hartman

Btrfs: don't drop path when printing out tree errors in scrub

2013-04-05T16:26:02+00:00

commit d8fe29e9dea8d7d61fd140d8779326856478fc62 upstream.

A user reported a panic where we were panicing somewhere in
tree_backref_for_extent from scrub_print_warning.  He only captured the trace
but looking at scrub_print_warning we drop the path right before we mess with
the extent buffer to print out a bunch of stuff, which isn't right.  So fix this
by dropping the path after we use the eb if we need to.  Thanks,

Signed-off-by: Josef Bacik 
Signed-off-by: Chris Mason 
Signed-off-by: Greg Kroah-Hartman

Btrfs: limit the global reserve to 512mb

2013-04-05T16:26:02+00:00

commit fdf30d1c1b386e1b73116cc7e0fb14e962b763b0 upstream.

A user reported a problem where he was getting early ENOSPC with hundreds of
gigs of free data space and 6 gigs of free metadata space.  This is because the
global block reserve was taking up the entire free metadata space.  This is
ridiculous, we have infrastructure in place to throttle if we start using too
much of the global reserve, so instead of letting it get this huge just limit it
to 512mb so that users can still get work done.  This allowed the user to
complete his rsync without issues.  Thanks

Reported-and-tested-by: Stefan Priebe 
Signed-off-by: Josef Bacik 
Signed-off-by: Greg Kroah-Hartman

Btrfs: fix race between mmap writes and compression

2013-04-05T16:26:02+00:00

commit 4adaa611020fa6ac65b0ac8db78276af4ec04e63 upstream.

Btrfs uses page_mkwrite to ensure stable pages during
crc calculations and mmap workloads.  We call clear_page_dirty_for_io
before we do any crcs, and this forces any application with the file
mapped to wait for the crc to finish before it is allowed to change
the file.

With compression on, the clear_page_dirty_for_io step is happening after
we've compressed the pages.  This means the applications might be
changing the pages while we are compressing them, and some of those
modifications might not hit the disk.

This commit adds the clear_page_dirty_for_io before compression starts
and makes sure to redirty the page if we have to fallback to
uncompressed IO as well.

Signed-off-by: Chris Mason 
Reported-by: Alexandre Oliva 
Signed-off-by: Greg Kroah-Hartman

Btrfs: fix locking on ROOT_REPLACE operations in tree mod log

2013-04-05T16:26:02+00:00

commit d9abbf1c3131b679379762700201ae69367f3f62 upstream.

To resolve backrefs, ROOT_REPLACE operations in the tree mod log are
required to be tied to at least one KEY_REMOVE_WHILE_FREEING operation.
Therefore, those operations must be enclosed by tree_mod_log_write_lock()
and tree_mod_log_write_unlock() calls.

Those calls are private to the tree_mod_log_* functions, which means that
removal of the elements of an old root node must be logged from
tree_mod_log_insert_root. This partly reverts and corrects commit ba1bfbd5
(Btrfs: fix a tree mod logging issue for root replacement operations).

This fixes the brand-new version of xfstest 276 as of commit cfe73f71.

Signed-off-by: Jan Schmidt 
Signed-off-by: Josef Bacik 
Signed-off-by: Chris Mason 
Signed-off-by: Greg Kroah-Hartman

Btrfs: use set_nlink if our i_nlink is 0

2013-04-05T16:26:02+00:00

commit 9bf7a4890518186238d2579be16ecc5190a707c0 upstream.

We need to inc the nlink of deleted entries when running replay so we can do the
unlink on the fs_root and get everything cleaned up and then have the orphan
cleanup do the right thing.  The problem is inc_nlink complains about this, even
thought it still does the right thing.  So use set_nlink() if our i_nlink is 0
to keep users from seeing the warnings during log replay.  Thanks,

Signed-off-by: Josef Bacik 
Signed-off-by: Greg Kroah-Hartman

btrfs: use rcu_barrier() to wait for bdev puts at unmount

2013-03-20T20:10:56+00:00

commit bc178622d40d87e75abc131007342429c9b03351 upstream.

Doing this would reliably fail with -EBUSY for me:

# mount /dev/sdb2 /mnt/scratch; umount /mnt/scratch; mkfs.btrfs -f /dev/sdb2
...
unable to open /dev/sdb2: Device or resource busy

because mkfs.btrfs tries to open the device O_EXCL, and somebody still has it.

Using systemtap to track bdev gets & puts shows a kworker thread doing a
blkdev put after mkfs attempts a get; this is left over from the unmount
path:

btrfs_close_devices
	__btrfs_close_devices
		call_rcu(&device->rcu, free_device);
			free_device
				INIT_WORK(&device->rcu_work, __free_device);
				schedule_work(&device->rcu_work);

so unmount might complete before __free_device fires & does its blkdev_put.

Adding an rcu_barrier() to btrfs_close_devices() causes unmount to wait
until all blkdev_put()s are done, and the device is truly free once
unmount completes.

Signed-off-by: Eric Sandeen 
Signed-off-by: Josef Bacik 
Signed-off-by: Chris Mason 
Signed-off-by: Greg Kroah-Hartman