linux-stable.git/fs/btrfs, branch linux-3.19.y

Btrfs: fix inode eviction infinite loop after extent_same ioctl

2015-05-06T20:01:41+00:00

commit 113e8283869b9855c8b999796aadd506bbac155f upstream.

If we pass a length of 0 to the extent_same ioctl, we end up locking an
extent range with a start offset greater then its end offset (if the
destination file's offset is greater than zero). This results in a warning
from extent_io.c:insert_state through the following call chain:

  btrfs_extent_same()
    btrfs_double_lock()
      lock_extent_range()
        lock_extent(inode->io_tree, offset, offset + len - 1)
          lock_extent_bits()
            __set_extent_bit()
              insert_state()
                --> WARN_ON(end < start)

This leads to an infinite loop when evicting the inode. This is the same
problem that my previous patch titled
"Btrfs: fix inode eviction infinite loop after cloning into it" addressed
but for the extent_same ioctl instead of the clone ioctl.

Signed-off-by: Filipe Manana 
Reviewed-by: Omar Sandoval 
Signed-off-by: Chris Mason 
Signed-off-by: Greg Kroah-Hartman

Btrfs: fix inode eviction infinite loop after cloning into it

2015-05-06T20:01:40+00:00

commit ccccf3d67294714af2d72a6fd6fd7d73b01c9329 upstream.

If we attempt to clone a 0 length region into a file we can end up
inserting a range in the inode's extent_io tree with a start offset
that is greater then the end offset, which triggers immediately the
following warning:

[ 3914.619057] WARNING: CPU: 17 PID: 4199 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
[ 3914.620886] BTRFS: end < start 4095 4096
(...)
[ 3914.638093] Call Trace:
[ 3914.638636]  [] dump_stack+0x4c/0x65
[ 3914.639620]  [] warn_slowpath_common+0xa1/0xbb
[ 3914.640789]  [] ? insert_state+0x4b/0x10b [btrfs]
[ 3914.642041]  [] warn_slowpath_fmt+0x46/0x48
[ 3914.643236]  [] insert_state+0x4b/0x10b [btrfs]
[ 3914.644441]  [] __set_extent_bit+0x107/0x3f4 [btrfs]
[ 3914.645711]  [] lock_extent_bits+0x65/0x1bf [btrfs]
[ 3914.646914]  [] ? _raw_spin_unlock+0x28/0x33
[ 3914.648058]  [] ? test_range_bit+0xcc/0xde [btrfs]
[ 3914.650105]  [] lock_extent+0x13/0x15 [btrfs]
[ 3914.651361]  [] lock_extent_range+0x3d/0xcd [btrfs]
[ 3914.652761]  [] btrfs_ioctl_clone+0x278/0x388 [btrfs]
[ 3914.654128]  [] ? might_fault+0x58/0xb5
[ 3914.655320]  [] btrfs_ioctl+0xb51/0x2195 [btrfs]
(...)
[ 3914.669271] ---[ end trace 14843d3e2e622fc1 ]---

This later makes the inode eviction handler enter an infinite loop that
keeps dumping the following warning over and over:

[ 3915.117629] WARNING: CPU: 22 PID: 4228 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
[ 3915.119913] BTRFS: end < start 4095 4096
(...)
[ 3915.137394] Call Trace:
[ 3915.137913]  [] dump_stack+0x4c/0x65
[ 3915.139154]  [] warn_slowpath_common+0xa1/0xbb
[ 3915.140316]  [] ? insert_state+0x4b/0x10b [btrfs]
[ 3915.141505]  [] warn_slowpath_fmt+0x46/0x48
[ 3915.142709]  [] insert_state+0x4b/0x10b [btrfs]
[ 3915.143849]  [] __set_extent_bit+0x107/0x3f4 [btrfs]
[ 3915.145120]  [] ? btrfs_kill_super+0x17/0x23 [btrfs]
[ 3915.146352]  [] ? deactivate_locked_super+0x3b/0x50
[ 3915.147565]  [] lock_extent_bits+0x65/0x1bf [btrfs]
[ 3915.148785]  [] ? _raw_write_unlock+0x28/0x33
[ 3915.149931]  [] btrfs_evict_inode+0x196/0x482 [btrfs]
[ 3915.151154]  [] evict+0xa0/0x148
[ 3915.152094]  [] dispose_list+0x39/0x43
[ 3915.153081]  [] evict_inodes+0xdc/0xeb
[ 3915.154062]  [] generic_shutdown_super+0x49/0xef
[ 3915.155193]  [] kill_anon_super+0x13/0x1e
[ 3915.156274]  [] btrfs_kill_super+0x17/0x23 [btrfs]
(...)
[ 3915.167404] ---[ end trace 14843d3e2e622fc2 ]---

So just bail out of the clone ioctl if the length of the region to clone
is zero, without locking any extent range, in order to prevent this issue
(same behaviour as a pwrite with a 0 length for example).

This is trivial to reproduce. For example, the steps for the test I just
made for fstests:

  mkfs.btrfs -f SCRATCH_DEV
  mount SCRATCH_DEV $SCRATCH_MNT

  touch $SCRATCH_MNT/foo
  touch $SCRATCH_MNT/bar

  $CLONER_PROG -s 0 -d 4096 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar
  umount $SCRATCH_MNT

A test case for fstests follows soon.

Signed-off-by: Filipe Manana 
Reviewed-by: Omar Sandoval 
Signed-off-by: Chris Mason 
Signed-off-by: Greg Kroah-Hartman

btrfs: don't accept bare namespace as a valid xattr

2015-05-06T20:01:40+00:00

commit 3c3b04d10ff1811a27f86684ccd2f5ba6983211d upstream.

Due to insufficient check in btrfs_is_valid_xattr, this unexpectedly
works:

 $ touch file
 $ setfattr -n user. -v 1 file
 $ getfattr -d file
user.="1"

ie. the missing attribute name after the namespace.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=94291
Reported-by: William Douglas 
Signed-off-by: David Sterba 
Signed-off-by: Chris Mason 
Signed-off-by: Greg Kroah-Hartman

Btrfs: fix log tree corruption when fs mounted with -o discard

2015-05-06T20:01:40+00:00

commit dcc82f4783ad91d4ab654f89f37ae9291cdc846a upstream.

While committing a transaction we free the log roots before we write the
new super block. Freeing the log roots implies marking the disk location
of every node/leaf (metadata extent) as pinned before the new super block
is written. This is to prevent the disk location of log metadata extents
from being reused before the new super block is written, otherwise we
would have a corrupted log tree if before the new super block is written
a crash/reboot happens and the location of any log tree metadata extent
ended up being reused and rewritten.

Even though we pinned the log tree's metadata extents, we were issuing a
discard against them if the fs was mounted with the -o discard option,
resulting in corruption of the log tree if a crash/reboot happened before
writing the new super block - the next time the fs was mounted, during
the log replay process we would find nodes/leafs of the log btree with
a content full of zeroes, causing the process to fail and require the
use of the tool btrfs-zero-log to wipeout the log tree (and all data
previously fsynced becoming lost forever).

Fix this by not doing a discard when pinning an extent. The discard will
be done later when it's safe (after the new super block is committed) at
extent-tree.c:btrfs_finish_extent_commit().

Fixes: e688b7252f78 (Btrfs: fix extent pinning bugs in the tree log)
Signed-off-by: Filipe Manana 
Signed-off-by: Chris Mason 
Signed-off-by: Greg Kroah-Hartman

btrfs: simplify insert_orphan_item

2015-04-19T08:10:15+00:00

commit 9c4f61f01d269815bb7c37be3ede59c5587747c6 upstream.

We can search and add the orphan item in one go,
btrfs_insert_orphan_item will find out if the item already exists.

Signed-off-by: David Sterba 
Cc: Chris Mason 
Cc: Roman Mamedov 
Signed-off-by: Greg Kroah-Hartman

Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref.

2015-03-18T13:10:59+00:00

commit dd9ef135e3542ffc621c4eb7f0091870ec7a1504 upstream.

Improper arithmetics when calculting the address of the extended ref could
lead to an out of bounds memory read and kernel panic.

Signed-off-by: Quentin Casasnovas 
Reviewed-by: David Sterba 
Signed-off-by: Chris Mason 
Signed-off-by: Greg Kroah-Hartman

Btrfs: fix data loss in the fast fsync path

2015-03-18T13:10:59+00:00

commit 3a8b36f378060d20062a0918e99fae39ff077bf0 upstream.

When using the fast file fsync code path we can miss the fact that new
writes happened since the last file fsync and therefore return without
waiting for the IO to finish and write the new extents to the fsync log.

Here's an example scenario where the fsync will miss the fact that new
file data exists that wasn't yet durably persisted:

1. fs_info->last_trans_committed == N - 1 and current transaction is
   transaction N (fs_info->generation == N);

2. do a buffered write;

3. fsync our inode, this clears our inode's full sync flag, starts
   an ordered extent and waits for it to complete - when it completes
   at btrfs_finish_ordered_io(), the inode's last_trans is set to the
   value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
   btrfs_set_inode_last_trans);

4. transaction N is committed, so fs_info->last_trans_committed is now
   set to the value N and fs_info->generation remains with the value N;

5. do another buffered write, when this happens btrfs_file_write_iter
   sets our inode's last_trans to the value N + 1 (that is
   fs_info->generation + 1 == N + 1);

6. transaction N + 1 is started and fs_info->generation now has the
   value N + 1;

7. transaction N + 1 is committed, so fs_info->last_trans_committed
   is set to the value N + 1;

8. fsync our inode - because it doesn't have the full sync flag set,
   we only start the ordered extent, we don't wait for it to complete
   (only in a later phase) therefore its last_trans field has the
   value N + 1 set previously by btrfs_file_write_iter(), and so we
   have:

       inode->last_trans <= fs_info->last_trans_committed
           (N + 1)              (N + 1)

   Which made us not log the last buffered write and exit the fsync
   handler immediately, returning success (0) to user space and resulting
   in data loss after a crash.

This can actually be triggered deterministically and the following excerpt
from a testcase I made for xfstests triggers the issue. It moves a dummy
file across directories and then fsyncs the old parent directory - this
is just to trigger a transaction commit, so moving files around isn't
directly related to the issue but it was chosen because running 'sync' for
example does more than just committing the current transaction, as it
flushes/waits for all file data to be persisted. The issue can also happen
at random periods, since the transaction kthread periodicaly commits the
current transaction (about every 30 seconds by default).
The body of the test is:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our main test file 'foo', the one we check for data loss.
  # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
  # bit from its flags (btrfs inode specific flags).
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
                  -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io

  # Now create one other file and 2 directories. We will move this second file
  # from one directory to the other later because it forces btrfs to commit its
  # currently open transaction if we fsync the old parent directory. This is
  # necessary to trigger the data loss bug that affected btrfs.
  mkdir $SCRATCH_MNT/testdir_1
  touch $SCRATCH_MNT/testdir_1/bar
  mkdir $SCRATCH_MNT/testdir_2

  # Make sure everything is durably persisted.
  sync

  # Write more 8Kb of data to our file.
  $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io

  # Move our 'bar' file into a new directory.
  mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar

  # Fsync our first directory. Because it had a file moved into some other
  # directory, this made btrfs commit the currently open transaction. This is
  # a condition necessary to trigger the data loss bug.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1

  # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
  # data we wrote previously to be persisted and available if a crash happens.
  # This did not happen with btrfs, because of the transaction commit that
  # happened when we fsynced the parent directory.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # Now check that all data we wrote before are available.
  echo "File content after log replay:"
  od -t x1 $SCRATCH_MNT/foo

  status=0
  exit

The expected golden output for the test, which is what we get with this
fix applied (or when running against ext3/4 and xfs), is:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 8192/8192 bytes at offset 8192
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
  *
  0040000

Without this fix applied, the output shows the test file does not have
the second 8Kb extent that we successfully fsynced:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 8192/8192 bytes at offset 8192
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000

So fix this by skipping the fsync only if we're doing a full sync and
if the inode's last_trans is <= fs_info->last_trans_committed, or if
the inode is already in the log. Also remove setting the inode's
last_trans in btrfs_file_write_iter since it's useless/unreliable.

Also because btrfs_file_write_iter no longer sets inode->last_trans to
fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't
bail out if last_trans is 0, otherwise something as simple as the following
example wouldn't log the second write on the last fsync:

  1. write to file

  2. fsync file

  3. fsync file
       |--> btrfs_inode_in_log() returns true and it set last_trans to 0

  4. write to file
       |--> btrfs_file_write_iter() no longers sets last_trans, so it
            remained with a value of 0
  5. fsync
       |--> inode->last_trans == 0, so it bails out without logging the
            second write

A test case for xfstests will be sent soon.

Signed-off-by: Filipe Manana 
Signed-off-by: Chris Mason 
Signed-off-by: Greg Kroah-Hartman

btrfs: fix lost return value due to variable shadowing

2015-03-18T13:10:59+00:00

commit 1932b7be973b554ffe20a5bba6ffaed6fa995cdc upstream.

A block-local variable stores error code but btrfs_get_blocks_direct may
not return it in the end as there's a ret defined in the function scope.

Fixes: d187663ef24c ("Btrfs: lock extents as we map them in DIO")
Signed-off-by: David Sterba 
Signed-off-by: Chris Mason 
Signed-off-by: Greg Kroah-Hartman

Btrfs: fix fsync race leading to ordered extent memory leaks

2015-03-18T13:10:59+00:00

commit 4d884fceaa2c838abb598778813e93f6d9fea723 upstream.

We can have multiple fsync operations against the same file during the
same transaction and they can collect the same ordered extents while they
don't complete (still accessible from the inode's ordered tree). If this
happens, those ordered extents will never get their reference counts
decremented to 0, leading to memory leaks and inode leaks (an iput for an
ordered extent's inode is scheduled only when the ordered extent's refcount
drops to 0). The following sequence diagram explains this race:

         CPU 1                                         CPU 2

btrfs_sync_file()

                                                 btrfs_sync_file()

  mutex_lock(inode->i_mutex)
  btrfs_log_inode()
    btrfs_get_logged_extents()
      --> collects ordered extent X
      --> increments ordered
          extent X's refcount
    btrfs_submit_logged_extents()
  mutex_unlock(inode->i_mutex)

                                                   mutex_lock(inode->i_mutex)
  btrfs_sync_log()
     btrfs_wait_logged_extents()
       --> list_del_init(&ordered->log_list)
                                                     btrfs_log_inode()
                                                       btrfs_get_logged_extents()
                                                         --> Adds ordered extent X
                                                             to logged_list because
                                                             at this point:
                                                             list_empty(&ordered->log_list)
                                                             && test_bit(BTRFS_ORDERED_LOGGED,
                                                                         &ordered->flags) == 0
                                                         --> Increments ordered extent
                                                             X's refcount
       --> check if ordered extent's io is
           finished or not, start it if
           necessary and wait for it to finish
       --> sets bit BTRFS_ORDERED_LOGGED
           on ordered extent X's flags
           and adds it to trans->ordered
  btrfs_sync_log() finishes

                                                       btrfs_submit_logged_extents()
                                                     btrfs_log_inode() finishes
                                                   mutex_unlock(inode->i_mutex)

btrfs_sync_file() finishes

                                                   btrfs_sync_log()
                                                      btrfs_wait_logged_extents()
                                                        --> Sees ordered extent X has the
                                                            bit BTRFS_ORDERED_LOGGED set in
                                                            its flags
                                                        --> X's refcount is untouched
                                                   btrfs_sync_log() finishes

                                                 btrfs_sync_file() finishes

btrfs_commit_transaction()
  --> called by transaction kthread for e.g.
  btrfs_wait_pending_ordered()
    --> waits for ordered extent X to
        complete
    --> decrements ordered extent X's
        refcount by 1 only, corresponding
        to the increment done by the fsync
        task ran by CPU 1

In the scenario of the above diagram, after the transaction commit,
the ordered extent will remain with a refcount of 1 forever, leaking
the ordered extent structure and preventing the i_count of its inode
from ever decreasing to 0, since the delayed iput is scheduled only
when the ordered extent's refcount drops to 0, preventing the inode
from ever being evicted by the VFS.

Fix this by using the flag BTRFS_ORDERED_LOGGED differently. Use it to
mean that an ordered extent is already being processed by an fsync call,
which will attach it to the current transaction, preventing it from being
collected by subsequent fsync operations against the same inode.

This race was introduced with the following change (added in 3.19 and
backported to stable 3.18 and 3.17):

  Btrfs: make sure logged extents complete in the current transaction V3
  commit 50d9aa99bd35c77200e0e3dd7a72274f8304701f

I ran into this issue while running xfstests/generic/113 in a loop, which
failed about 1 out of 10 runs with the following warning in dmesg:

[ 2612.440038] WARNING: CPU: 4 PID: 22057 at fs/btrfs/disk-io.c:3558 free_fs_root+0x36/0x133 [btrfs]()
[ 2612.442810] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop processor parport_pc parport psmouse therma
l_sys i2c_piix4 serio_raw pcspkr evdev microcode button i2c_core ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic virtio_pci ata_piix virtio_ring libata virtio flo
ppy e1000 scsi_mod [last unloaded: btrfs]
[ 2612.452711] CPU: 4 PID: 22057 Comm: umount Tainted: G        W      3.19.0-rc5-btrfs-next-4+ #1
[ 2612.454921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 2612.457709]  0000000000000009 ffff8801342c3c78 ffffffff8142425e ffff88023ec8f2d8
[ 2612.459829]  0000000000000000 ffff8801342c3cb8 ffffffff81045308 ffff880046460000
[ 2612.461564]  ffffffffa036da56 ffff88003d07b000 ffff880046460000 ffff880046460068
[ 2612.463163] Call Trace:
[ 2612.463719]  [] dump_stack+0x4c/0x65
[ 2612.464789]  [] warn_slowpath_common+0xa1/0xbb
[ 2612.466026]  [] ? free_fs_root+0x36/0x133 [btrfs]
[ 2612.467247]  [] warn_slowpath_null+0x1a/0x1c
[ 2612.468416]  [] free_fs_root+0x36/0x133 [btrfs]
[ 2612.469625]  [] btrfs_drop_and_free_fs_root+0x93/0x9b [btrfs]
[ 2612.471251]  [] btrfs_free_fs_roots+0xa4/0xd6 [btrfs]
[ 2612.472536]  [] ? wait_for_completion+0x24/0x26
[ 2612.473742]  [] close_ctree+0x1f3/0x33c [btrfs]
[ 2612.475477]  [] ? destroy_workqueue+0x148/0x1ba
[ 2612.476695]  [] btrfs_put_super+0x19/0x1b [btrfs]
[ 2612.477911]  [] generic_shutdown_super+0x73/0xef
[ 2612.479106]  [] kill_anon_super+0x13/0x1e
[ 2612.480226]  [] btrfs_kill_super+0x17/0x23 [btrfs]
[ 2612.481471]  [] deactivate_locked_super+0x3b/0x50
[ 2612.482686]  [] deactivate_super+0x3f/0x43
[ 2612.483791]  [] cleanup_mnt+0x59/0x78
[ 2612.484842]  [] __cleanup_mnt+0x12/0x14
[ 2612.485900]  [] task_work_run+0x8f/0xbc
[ 2612.486960]  [] do_notify_resume+0x5a/0x6b
[ 2612.488083]  [] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 2612.489333]  [] int_signal+0x12/0x17
[ 2612.490353] ---[ end trace 54a960a6bdcb8d93 ]---
[ 2612.557253] VFS: Busy inodes after unmount of sdb. Self-destruct in 5 seconds.  Have a nice day...

Kmemleak confirmed the ordered extent leak (and btrfs inode specific
structures such as delayed nodes):

$ cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff880154290db0 (size 576):
  comm "btrfsck", pid 21980, jiffies 4295542503 (age 1273.412s)
  hex dump (first 32 bytes):
    01 40 00 00 01 00 00 00 b0 1d f1 4e 01 88 ff ff  .@.........N....
    00 00 00 00 00 00 00 00 c8 0d 29 54 01 88 ff ff  ..........)T....
  backtrace:
    [] kmemleak_update_trace+0x4c/0x6a
    [] radix_tree_node_alloc+0x6d/0x83
    [] __radix_tree_create+0x109/0x190
    [] radix_tree_insert+0x30/0xac
    [] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs]
    [] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs]
    [] __btrfs_unlink_inode+0xee/0x288 [btrfs]
    [] btrfs_unlink_inode+0x1e/0x40 [btrfs]
    [] btrfs_unlink+0x60/0x9b [btrfs]
    [] vfs_unlink+0x9c/0xed
    [] do_unlinkat+0x12c/0x1fa
    [] SyS_unlinkat+0x29/0x2b
    [] system_call_fastpath+0x12/0x17
    [] 0xffffffffffffffff
unreferenced object 0xffff88014ef11db0 (size 576):
  comm "rm", pid 22009, jiffies 4295542593 (age 1273.052s)
  hex dump (first 32 bytes):
    02 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 c8 1d f1 4e 01 88 ff ff  ...........N....
  backtrace:
    [] kmemleak_update_trace+0x4c/0x6a
    [] radix_tree_node_alloc+0x6d/0x83
    [] __radix_tree_create+0x109/0x190
    [] radix_tree_insert+0x30/0xac
    [] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs]
    [] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs]
    [] __btrfs_unlink_inode+0xee/0x288 [btrfs]
    [] btrfs_unlink_inode+0x1e/0x40 [btrfs]
    [] btrfs_unlink+0x60/0x9b [btrfs]
    [] vfs_unlink+0x9c/0xed
    [] do_unlinkat+0x12c/0x1fa
    [] SyS_unlinkat+0x29/0x2b
    [] system_call_fastpath+0x12/0x17
    [] 0xffffffffffffffff
unreferenced object 0xffff8800336feda8 (size 584):
  comm "aio-stress", pid 22031, jiffies 4295543006 (age 1271.400s)
  hex dump (first 32 bytes):
    00 40 3e 00 00 00 00 00 00 00 8f 42 00 00 00 00  .@>........B....
    00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00  ................
  backtrace:
    [] create_object+0x172/0x29a
    [] kmemleak_alloc+0x25/0x41
    [] kmemleak_alloc_recursive.constprop.52+0x16/0x18
    [] kmem_cache_alloc+0xf7/0x198
    [] __btrfs_add_ordered_extent+0x43/0x309 [btrfs]
    [] btrfs_add_ordered_extent_dio+0x12/0x14 [btrfs]
    [] btrfs_get_blocks_direct+0x3ef/0x571 [btrfs]
    [] do_blockdev_direct_IO+0x62a/0xb47
    [] __blockdev_direct_IO+0x34/0x36
    [] btrfs_direct_IO+0x16a/0x1e8 [btrfs]
    [] generic_file_direct_write+0xb8/0x12d
    [] btrfs_file_write_iter+0x24b/0x42f [btrfs]
    [] aio_run_iocb+0x2b7/0x32e
    [] do_io_submit+0x26e/0x2ff
    [] SyS_io_submit+0x10/0x12
    [] system_call_fastpath+0x12/0x17

Signed-off-by: Filipe Manana 
Signed-off-by: Chris Mason 
Signed-off-by: Greg Kroah-Hartman

Btrfs: fix fsync data loss after adding hard link to inode

2015-03-06T22:57:42+00:00

commit 1a4bcf470c886b955adf36486f4c86f2441d85cb upstream.

We have a scenario where after the fsync log replay we can lose file data
that had been previously fsync'ed if we added an hard link for our inode
and after that we sync'ed the fsync log (for example by fsync'ing some
other file or directory).

This is because when adding an hard link we updated the inode item in the
log tree with an i_size value of 0. At that point the new inode item was
in memory only and a subsequent fsync log replay would not make us lose
the file data. However if after adding the hard link we sync the log tree
to disk, by fsync'ing some other file or directory for example, we ended
up losing the file data after log replay, because the inode item in the
persisted log tree had an an i_size of zero.

This is easy to reproduce, and the following excerpt from my test for
xfstests shows this:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create one file with data and fsync it.
  # This made the btrfs fsync log persist the data and the inode metadata with
  # a correct inode->i_size (4096 bytes).
  $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 4K 0 4K" -c "fsync" \
       $SCRATCH_MNT/foo | _filter_xfs_io

  # Now add one hard link to our file. This made the btrfs code update the fsync
  # log, in memory only, with an inode metadata having a size of 0.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link

  # Now force persistence of the fsync log to disk, for example, by fsyncing some
  # other file.
  touch $SCRATCH_MNT/bar
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar

  # Before a power loss or crash, we could read the 4Kb of data from our file as
  # expected.
  echo "File content before:"
  od -t x1 $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # After the fsync log replay, because the fsync log had a value of 0 for our
  # inode's i_size, we couldn't read anymore the 4Kb of data that we previously
  # wrote and fsync'ed. The size of the file became 0 after the fsync log replay.
  echo "File content after:"
  od -t x1 $SCRATCH_MNT/foo

Another alternative test, that doesn't need to fsync an inode in the same
transaction it was created, is:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test file with some data.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa -b 8K 0 8K" \
       $SCRATCH_MNT/foo | _filter_xfs_io

  # Make sure the file is durably persisted.
  sync

  # Append some data to our file, to increase its size.
  $XFS_IO_PROG -f -c "pwrite -S 0xcc -b 4K 8K 4K" \
       $SCRATCH_MNT/foo | _filter_xfs_io

  # Fsync the file, so from this point on if a crash/power failure happens, our
  # new data is guaranteed to be there next time the fs is mounted.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Add one hard link to our file. This made btrfs write into the in memory fsync
  # log a special inode with generation 0 and an i_size of 0 too. Note that this
  # didn't update the inode in the fsync log on disk.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link

  # Now make sure the in memory fsync log is durably persisted.
  # Creating and fsync'ing another file will do it.
  touch $SCRATCH_MNT/bar
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar

  # As expected, before the crash/power failure, we should be able to read the
  # 12Kb of file data.
  echo "File content before:"
  od -t x1 $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # After mounting the fs again, the fsync log was replayed.
  # The btrfs fsync log replay code didn't update the i_size of the persisted
  # inode because the inode item in the log had a special generation with a
  # value of 0 (and it couldn't know the correct i_size, since that inode item
  # had a 0 i_size too). This made the last 4Kb of file data inaccessible and
  # effectively lost.
  echo "File content after:"
  od -t x1 $SCRATCH_MNT/foo

This isn't a new issue/regression. This problem has been around since the
log tree code was added in 2008:

  Btrfs: Add a write ahead tree log to optimize synchronous operations
  (commit e02119d5a7b4396c5a872582fddc8bd6d305a70a)

Test cases for xfstests follow soon.

Signed-off-by: Filipe Manana 
Signed-off-by: Chris Mason 
Signed-off-by: Greg Kroah-Hartman