linux.git/fs/btrfs/ordered-data.c, branch v4.6

btrfs: Fix misspellings in comments.

2016-03-14T14:05:02+00:00

Signed-off-by: Adam Buchbinder 
Signed-off-by: David Sterba

btrfs: move btrfs_compression_type to compression.h

2016-03-11T16:12:46+00:00

So that its better organized.

Signed-off-by: Anand Jain 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba

btrfs: drop null testing before destroy functions

2016-02-18T10:46:03+00:00

Cleanup.

kmem_cache_destroy has support NULL argument checking,
so drop the double null testing before calling it.

Signed-off-by: Kinglong Mee 
Signed-off-by: David Sterba

Btrfs: change how we wait for pending ordered extents

2015-10-22T01:51:40+00:00

We have a mechanism to make sure we don't lose updates for ordered extents that
were logged in the transaction that is currently running.  We add the ordered
extent to a transaction list and then the transaction waits on all the ordered
extents in that list.  However are substantially large file systems this list
can be extremely large, and can give us soft lockups, since the ordered extents
don't remove themselves from the list when they do complete.

To fix this we simply add a counter to the transaction that is incremented any
time we have a logged extent that needs to be completed in the current
transaction.  Then when the ordered extent finally completes it decrements the
per transaction counter and wakes up the transaction if we are the last ones.
This will eliminate the softlockup.  Thanks,

Signed-off-by: Josef Bacik 
Signed-off-by: Chris Mason

btrfs: add comments to barriers before waitqueue_active

2015-10-10T16:40:04+00:00

Reduce number of undocumented barriers out there.

Signed-off-by: David Sterba

Btrfs: fix memory corruption on failure to submit bio for direct IO

2015-07-02T00:17:18+00:00

If we fail to submit a bio for a direct IO request, we were grabbing the
corresponding ordered extent and decrementing its reference count twice,
once for our lookup reference and once for the ordered tree reference.
This was a problem because it caused the ordered extent to be freed
without removing it from the ordered tree and any lists it might be
attached to, leaving dangling pointers to the ordered extent around.
Example trace with CONFIG_DEBUG_PAGEALLOC=y:

[161779.858707] BUG: unable to handle kernel paging request at 0000000087654330
[161779.859983] IP: [] rb_prev+0x22/0x3b
[161779.860636] PGD 34d818067 PUD 0
[161779.860636] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
(...)
[161779.860636] Call Trace:
[161779.860636]  [] __tree_search+0xd9/0xf9 [btrfs]
[161779.860636]  [] tree_search+0x42/0x63 [btrfs]
[161779.860636]  [] ? btrfs_lookup_ordered_range+0x2d/0xa5 [btrfs]
[161779.860636]  [] btrfs_lookup_ordered_range+0x38/0xa5 [btrfs]
[161779.860636]  [] btrfs_get_blocks_direct+0x11b/0x615 [btrfs]
[161779.860636]  [] do_blockdev_direct_IO+0x5ff/0xb43
[161779.860636]  [] ? btrfs_page_exists_in_range+0x1ad/0x1ad [btrfs]
[161779.860636]  [] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
[161779.860636]  [] __blockdev_direct_IO+0x32/0x34
[161779.860636]  [] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
[161779.860636]  [] btrfs_direct_IO+0x198/0x21f [btrfs]
[161779.860636]  [] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
[161779.860636]  [] generic_file_direct_write+0xb3/0x128
[161779.860636]  [] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs]
[161779.860636]  [] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
(...)

We were also not freeing the btrfs_dio_private we allocated previously,
which kmemleak reported with the following trace in its sysfs file:

unreferenced object 0xffff8803f553bf80 (size 96):
  comm "xfs_io", pid 4501, jiffies 4295039588 (age 173.936s)
  hex dump (first 32 bytes):
    88 6c 9b f5 02 88 ff ff 00 00 00 00 00 00 00 00  .l..............
    00 00 00 00 00 00 00 00 00 00 c4 00 00 00 00 00  ................
  backtrace:
    [] create_object+0x172/0x29a
    [] kmemleak_alloc+0x25/0x41
    [] kmemleak_alloc_recursive.constprop.40+0x16/0x18
    [] kmem_cache_alloc_trace+0xfb/0x148
    [] btrfs_submit_direct+0x65/0x16a [btrfs]
    [] dio_bio_submit+0x62/0x8f
    [] do_blockdev_direct_IO+0x97e/0xb43
    [] __blockdev_direct_IO+0x32/0x34
    [] btrfs_direct_IO+0x198/0x21f [btrfs]
    [] generic_file_direct_write+0xb3/0x128
    [] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
    [] __vfs_write+0x7c/0xa5
    [] vfs_write+0xa0/0xe4
    [] SyS_pwrite64+0x64/0x82
    [] system_call_fastpath+0x12/0x6f
    [] 0xffffffffffffffff

For read requests we weren't doing any cleanup either (none of the work
done by btrfs_endio_direct_read()), so a failure submitting a bio for a
read request would leave a range in the inode's io_tree locked forever,
blocking any future operations (both reads and writes) against that range.

So fix this by making sure we do the same cleanup that we do for the case
where the bio submission succeeds.

Signed-off-by: Filipe Manana 
Signed-off-by: Chris Mason

Btrfs: don't attach unnecessary extents to transaction on fsync

2015-06-10T14:02:44+00:00

We don't need to attach ordered extents that have completed to the current
transaction. Doing so only makes us hold memory for longer than necessary
and delaying the iput of the inode until the transaction is committed (for
each created ordered extent we do an igrab and then schedule an asynchronous
iput when the ordered extent's reference count drops to 0), preventing the
inode from being evictable until the transaction commits.

Signed-off-by: Filipe Manana 
Signed-off-by: Chris Mason

Btrfs: avoid syncing log in the fast fsync path when not necessary

2015-06-10T14:02:43+00:00

Commit 3a8b36f37806 ("Btrfs: fix data loss in the fast fsync path") added
a performance regression for that causes an unnecessary sync of the log
trees (fs/subvol and root log trees) when 2 consecutive fsyncs are done
against a file, without no writes or any metadata updates to the inode in
between them and if a transaction is committed before the second fsync is
called.

Huang Ying reported this to lkml (https://lkml.org/lkml/2015/3/18/99)
after a test sysbench test that measured a -62% decrease of file io
requests per second for that tests' workload.

The test is:

  echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
  echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
  echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
  echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
  mkfs -t btrfs /dev/sda2
  mount -t btrfs /dev/sda2 /fs/sda2
  cd /fs/sda2
  for ((i = 0; i < 1024; i++)); do fallocate -l 67108864 testfile.$i; done
  sysbench --test=fileio --max-requests=0 --num-threads=4 --max-time=600 \
    --file-test-mode=rndwr --file-total-size=68719476736 --file-io-mode=sync \
    --file-num=1024 run

A test on kvm guest, running a debug kernel gave me the following results:

Without 3a8b36f378060d:             16.01 reqs/sec
With 3a8b36f378060d:                 3.39 reqs/sec
With 3a8b36f378060d and this patch: 16.04 reqs/sec

Reported-by: Huang Ying 
Tested-by: Huang, Ying 
Signed-off-by: Filipe Manana 
Signed-off-by: Chris Mason

Btrfs: remove csum_bytes_left

2015-06-03T11:03:06+00:00

After commit 8407f553268a
("Btrfs: fix data corruption after fast fsync and writeback error"),
during wait_ordered_extents(), we wait for ordered extent setting
BTRFS_ORDERED_IO_DONE or BTRFS_ORDERED_IOERR, at which point we've
already got checksum information, so we don't need to check
(csum_bytes_left == 0) in the whole logging path.

Signed-off-by: Liu Bo 
Signed-off-by: Chris Mason

Btrfs: fix panic when starting bg cache writeout after IO error

2015-05-11T14:59:10+00:00

When waiting for the writeback of block group cache we returned
immediately if there was an error during writeback without waiting
for the ordered extent to complete. This left a short time window
where if some other task attempts to start the writeout for the same
block group cache it can attempt to add a new ordered extent, starting
at the same offset (0) before the previous one is removed from the
ordered tree, causing an ordered tree panic (calls BUG()).

This normally doesn't happen in other write paths, such as buffered
writes or direct IO writes for regular files, since before marking
page ranges dirty we lock the ranges and wait for any ordered extents
within the range to complete first.

Fix this by making btrfs_wait_ordered_range() not return immediately
if it gets an error from the writeback, waiting for all ordered extents
to complete first.

This issue happened often when running the fstest btrfs/088 and it's
easy to trigger it by running in a loop until the panic happens:

  for ((i = 1; i <= 10000; i++)) do ./check btrfs/088 ; done

[17156.862573] BTRFS critical (device sdc): panic in ordered_data_tree_panic:70: Inconsistency in ordered tree at offset 0 (errno=-17 Object already exists)
[17156.864052] ------------[ cut here ]------------
[17156.864052] kernel BUG at fs/btrfs/ordered-data.c:70!
(...)
[17156.864052] Call Trace:
[17156.864052]  [] btrfs_add_ordered_extent+0x12/0x14 [btrfs]
[17156.864052]  [] run_delalloc_nocow+0x5bf/0x747 [btrfs]
[17156.864052]  [] run_delalloc_range+0x95/0x353 [btrfs]
[17156.864052]  [] writepage_delalloc.isra.16+0xb9/0x13f [btrfs]
[17156.864052]  [] __extent_writepage+0x129/0x1f7 [btrfs]
[17156.864052]  [] extent_write_cache_pages.isra.15.constprop.28+0x231/0x2f4 [btrfs]
[17156.864052]  [] ? __module_text_address+0x12/0x59
[17156.864052]  [] ? trace_hardirqs_on+0xd/0xf
[17156.864052]  [] extent_writepages+0x4b/0x5c [btrfs]
[17156.864052]  [] ? kmem_cache_free+0x9b/0xce
[17156.864052]  [] ? btrfs_submit_direct+0x3fc/0x3fc [btrfs]
[17156.864052]  [] ? free_extent_state+0x8c/0xc1 [btrfs]
[17156.864052]  [] btrfs_writepages+0x28/0x2a [btrfs]
[17156.864052]  [] do_writepages+0x23/0x2c
[17156.864052]  [] __filemap_fdatawrite_range+0x5a/0x61
[17156.864052]  [] filemap_fdatawrite_range+0x13/0x15
[17156.864052]  [] btrfs_fdatawrite_range+0x21/0x48 [btrfs]
[17156.864052]  [] __btrfs_write_out_cache.isra.14+0x2d9/0x3a7 [btrfs]
[17156.864052]  [] ? btrfs_write_out_cache+0x41/0xdc [btrfs]
[17156.864052]  [] btrfs_write_out_cache+0x93/0xdc [btrfs]
[17156.864052]  [] ? btrfs_start_dirty_block_groups+0x13a/0x2b2 [btrfs]
[17156.864052]  [] btrfs_start_dirty_block_groups+0x1d9/0x2b2 [btrfs]
[17156.864052]  [] ? trace_hardirqs_on+0xd/0xf
[17156.864052]  [] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
[17156.864052]  [] btrfs_sync_fs+0xe1/0x12d [btrfs]

Signed-off-by: Filipe Manana 
Signed-off-by: Chris Mason