<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/fs/btrfs/relocation.c, branch linux-4.7.y</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>Merge branch 'cleanups-4.7' into for-chris-4.7-20160525</title>
<updated>2016-05-25T20:51:03+00:00</updated>
<author>
<name>David Sterba</name>
<email>dsterba@suse.com</email>
</author>
<published>2016-05-25T20:51:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=42f31734eb7658fd01fb186d56312be869450a42'/>
<id>42f31734eb7658fd01fb186d56312be869450a42</id>
<content type='text'>
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
</pre>
</div>
</content>
</entry>
<entry>
<title>btrfs: fix string and comment grammatical issues and typos</title>
<updated>2016-05-25T20:35:14+00:00</updated>
<author>
<name>Nicholas D Steeves</name>
<email>nsteeves@gmail.com</email>
</author>
<published>2016-05-20T01:18:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=0132761017e012ab4dc8584d679503f2ba26ca86'/>
<id>0132761017e012ab4dc8584d679503f2ba26ca86</id>
<content type='text'>
Signed-off-by: Nicholas D Steeves &lt;nsteeves@gmail.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Signed-off-by: Nicholas D Steeves &lt;nsteeves@gmail.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'for-chris-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.7</title>
<updated>2016-05-17T21:43:19+00:00</updated>
<author>
<name>Chris Mason</name>
<email>clm@fb.com</email>
</author>
<published>2016-05-17T21:43:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=c315ef8d9db7f1a0ebd023a395ebdfde1c68057e'/>
<id>c315ef8d9db7f1a0ebd023a395ebdfde1c68057e</id>
<content type='text'>
Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Signed-off-by: Chris Mason &lt;clm@fb.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Btrfs: fix race between block group relocation and nocow writes</title>
<updated>2016-05-13T00:59:34+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2016-05-09T12:15:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=f78c436c3931e7df713688028f2b4faf72bf9f2a'/>
<id>f78c436c3931e7df713688028f2b4faf72bf9f2a</id>
<content type='text'>
Relocation of a block group waits for all existing tasks flushing
dellaloc, starting direct IO writes and any ordered extents before
starting the relocation process. However for direct IO writes that end
up doing nocow (inode either has the flag nodatacow set or the write is
against a prealloc extent) we have a short time window that allows for a
race that makes relocation proceed without waiting for the direct IO
write to complete first, resulting in data loss after the relocation
finishes. This is illustrated by the following diagram:

           CPU 1                                     CPU 2

 btrfs_relocate_block_group(bg X)

                                               direct IO write starts against
                                               an extent in block group X
                                               using nocow mode (inode has the
                                               nodatacow flag or the write is
                                               for a prealloc extent)

                                               btrfs_direct_IO()
                                                 btrfs_get_blocks_direct()
                                                   --&gt; can_nocow_extent() returns 1

   btrfs_inc_block_group_ro(bg X)
     --&gt; turns block group into RO mode

   btrfs_wait_ordered_roots()
     --&gt; returns and does not know about
         the DIO write happening at CPU 2
         (the task there has not created
          yet an ordered extent)

   relocate_block_group(bg X)
     --&gt; rc-&gt;stage == MOVE_DATA_EXTENTS

     find_next_extent()
       --&gt; returns extent that the DIO
           write is going to write to

     relocate_data_extent()

       relocate_file_extent_cluster()

         --&gt; reads the extent from disk into
             pages belonging to the relocation
             inode and dirties them

                                                   --&gt; creates DIO ordered extent

                                                 btrfs_submit_direct()
                                                   --&gt; submits bio against a location
                                                       on disk obtained from an extent
                                                       map before the relocation started

   btrfs_wait_ordered_range()
     --&gt; writes all the pages read before
         to disk (belonging to the
         relocation inode)

   relocation finishes

                                                 bio completes and wrote new data
                                                 to the old location of the block
                                                 group

So fix this by tracking the number of nocow writers for a block group and
make sure relocation waits for that number to go down to 0 before starting
to move the extents.

The same race can also happen with buffered writes in nocow mode since the
patch I recently made titled "Btrfs: don't do unnecessary delalloc flushes
when relocating", because we are no longer flushing all delalloc which
served as a synchonization mechanism (due to page locking) and ensured
the ordered extents for nocow buffered writes were created before we
called btrfs_wait_ordered_roots(). The race with direct IO writes in nocow
mode existed before that patch (no pages are locked or used during direct
IO) and that fixed only races with direct IO writes that do cow.

Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Reviewed-by: Josef Bacik &lt;jbacik@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Relocation of a block group waits for all existing tasks flushing
dellaloc, starting direct IO writes and any ordered extents before
starting the relocation process. However for direct IO writes that end
up doing nocow (inode either has the flag nodatacow set or the write is
against a prealloc extent) we have a short time window that allows for a
race that makes relocation proceed without waiting for the direct IO
write to complete first, resulting in data loss after the relocation
finishes. This is illustrated by the following diagram:

           CPU 1                                     CPU 2

 btrfs_relocate_block_group(bg X)

                                               direct IO write starts against
                                               an extent in block group X
                                               using nocow mode (inode has the
                                               nodatacow flag or the write is
                                               for a prealloc extent)

                                               btrfs_direct_IO()
                                                 btrfs_get_blocks_direct()
                                                   --&gt; can_nocow_extent() returns 1

   btrfs_inc_block_group_ro(bg X)
     --&gt; turns block group into RO mode

   btrfs_wait_ordered_roots()
     --&gt; returns and does not know about
         the DIO write happening at CPU 2
         (the task there has not created
          yet an ordered extent)

   relocate_block_group(bg X)
     --&gt; rc-&gt;stage == MOVE_DATA_EXTENTS

     find_next_extent()
       --&gt; returns extent that the DIO
           write is going to write to

     relocate_data_extent()

       relocate_file_extent_cluster()

         --&gt; reads the extent from disk into
             pages belonging to the relocation
             inode and dirties them

                                                   --&gt; creates DIO ordered extent

                                                 btrfs_submit_direct()
                                                   --&gt; submits bio against a location
                                                       on disk obtained from an extent
                                                       map before the relocation started

   btrfs_wait_ordered_range()
     --&gt; writes all the pages read before
         to disk (belonging to the
         relocation inode)

   relocation finishes

                                                 bio completes and wrote new data
                                                 to the old location of the block
                                                 group

So fix this by tracking the number of nocow writers for a block group and
make sure relocation waits for that number to go down to 0 before starting
to move the extents.

The same race can also happen with buffered writes in nocow mode since the
patch I recently made titled "Btrfs: don't do unnecessary delalloc flushes
when relocating", because we are no longer flushing all delalloc which
served as a synchonization mechanism (due to page locking) and ensured
the ordered extents for nocow buffered writes were created before we
called btrfs_wait_ordered_roots(). The race with direct IO writes in nocow
mode existed before that patch (no pages are locked or used during direct
IO) and that fixed only races with direct IO writes that do cow.

Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Reviewed-by: Josef Bacik &lt;jbacik@fb.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Btrfs: don't do unnecessary delalloc flushes when relocating</title>
<updated>2016-05-13T00:59:16+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2016-04-26T14:39:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=9cfa3e34e20e6798a671236000d9e97c8aa5d318'/>
<id>9cfa3e34e20e6798a671236000d9e97c8aa5d318</id>
<content type='text'>
Before we start the actual relocation process of a block group, we do
calls to flush delalloc of all inodes and then wait for ordered extents
to complete. However we do these flush calls just to make sure we don't
race with concurrent tasks that have actually already started to run
delalloc and have allocated an extent from the block group we want to
relocate, right before we set it to readonly mode, but have not yet
created the respective ordered extents. The flush calls make us wait
for such concurrent tasks because they end up calling
filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() -&gt;
__start_delalloc_inodes() -&gt; btrfs_alloc_delalloc_work() -&gt;
btrfs_run_delalloc_work()) which ends up serializing us with those tasks
due to attempts to lock the same pages (and the delalloc flush procedure
calls the allocator and creates the ordered extents before unlocking the
pages).

These flushing calls not only make us waste time (cpu, IO) but also reduce
the chances of writing larger extents (applications might be writing to
contiguous ranges and we flush before they finish dirtying the whole
ranges).

So make sure we don't flush delalloc and just wait for concurrent tasks
that have already started flushing delalloc and have allocated an extent
from the block group we are about to relocate.

This change also ends up fixing a race with direct IO writes that makes
relocation not wait for direct IO ordered extents. This race is
illustrated by the following diagram:

        CPU 1                                       CPU 2

 btrfs_relocate_block_group(bg X)

                                           starts direct IO write,
                                           target inode currently has no
                                           ordered extents ongoing nor
                                           dirty pages (delalloc regions),
                                           therefore the root for our inode
                                           is not in the list
                                           fs_info-&gt;ordered_roots

                                           btrfs_direct_IO()
                                             __blockdev_direct_IO()
                                               btrfs_get_blocks_direct()
                                                 btrfs_lock_extent_direct()
                                                   locks range in the io tree
                                                 btrfs_new_extent_direct()
                                                   btrfs_reserve_extent()
                                                     --&gt; extent allocated
                                                         from bg X

   btrfs_inc_block_group_ro(bg X)

   btrfs_start_delalloc_roots()
     __start_delalloc_inodes()
       --&gt; does nothing, no dealloc ranges
           in the inode's io tree so the
           inode's root is not in the list
           fs_info-&gt;delalloc_roots

   btrfs_wait_ordered_roots()
     --&gt; does not find the inode's root in the
         list fs_info-&gt;ordered_roots

     --&gt; ends up not waiting for the direct IO
         write started by the task at CPU 2

   relocate_block_group(rc-&gt;stage ==
     MOVE_DATA_EXTENTS)

     prepare_to_relocate()
       btrfs_commit_transaction()

     iterates the extent tree, using its
     commit root and moves extents into new
     locations

                                                   btrfs_add_ordered_extent_dio()
                                                     --&gt; now a ordered extent is
                                                         created and added to the
                                                         list root-&gt;ordered_extents
                                                         and the root added to the
                                                         list fs_info-&gt;ordered_roots
                                                     --&gt; this is too late and the
                                                         task at CPU 1 already
                                                         started the relocation

     btrfs_commit_transaction()

                                                   btrfs_finish_ordered_io()
                                                     btrfs_alloc_reserved_file_extent()
                                                       --&gt; adds delayed data reference
                                                           for the extent allocated
                                                           from bg X

   relocate_block_group(rc-&gt;stage ==
     UPDATE_DATA_PTRS)

     prepare_to_relocate()
       btrfs_commit_transaction()
         --&gt; delayed refs are run, so an extent
             item for the allocated extent from
             bg X is added to extent tree
         --&gt; commit roots are switched, so the
             next scan in the extent tree will
             see the extent item

     sees the extent in the extent tree

When this happens the relocation produces the following warning when it
finishes:

[ 7260.832836] ------------[ cut here ]------------
[ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]()
[ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
[ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1
[ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 7260.852998]  0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000
[ 7260.852998]  0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d
[ 7260.852998]  ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000
[ 7260.852998] Call Trace:
[ 7260.852998]  [&lt;ffffffff812648b3&gt;] dump_stack+0x67/0x90
[ 7260.852998]  [&lt;ffffffff81051608&gt;] warn_slowpath_common+0x99/0xb2
[ 7260.852998]  [&lt;ffffffffa03c1b2d&gt;] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
[ 7260.852998]  [&lt;ffffffff810516d4&gt;] warn_slowpath_null+0x1a/0x1c
[ 7260.852998]  [&lt;ffffffffa03c1b2d&gt;] btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
[ 7260.852998]  [&lt;ffffffffa039d9de&gt;] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs]
[ 7260.852998]  [&lt;ffffffffa039f314&gt;] btrfs_balance+0xde1/0xe4e [btrfs]
[ 7260.852998]  [&lt;ffffffff8127d671&gt;] ? debug_smp_processor_id+0x17/0x19
[ 7260.852998]  [&lt;ffffffffa03a9583&gt;] btrfs_ioctl_balance+0x255/0x2d3 [btrfs]
[ 7260.852998]  [&lt;ffffffffa03ac96a&gt;] btrfs_ioctl+0x11e0/0x1dff [btrfs]
[ 7260.852998]  [&lt;ffffffff811451df&gt;] ? handle_mm_fault+0x443/0xd63
[ 7260.852998]  [&lt;ffffffff81491817&gt;] ? _raw_spin_unlock+0x31/0x44
[ 7260.852998]  [&lt;ffffffff8108b36a&gt;] ? arch_local_irq_save+0x9/0xc
[ 7260.852998]  [&lt;ffffffff811876ab&gt;] vfs_ioctl+0x18/0x34
[ 7260.852998]  [&lt;ffffffff81187cb2&gt;] do_vfs_ioctl+0x550/0x5be
[ 7260.852998]  [&lt;ffffffff81190c30&gt;] ? __fget_light+0x4d/0x71
[ 7260.852998]  [&lt;ffffffff81187d77&gt;] SyS_ioctl+0x57/0x79
[ 7260.852998]  [&lt;ffffffff81492017&gt;] entry_SYSCALL_64_fastpath+0x12/0x6b
[ 7260.893268] ---[ end trace eb7803b24ebab8ad ]---

This is because at the end of the first stage, in relocate_block_group(),
we commit the current transaction, which makes delayed refs run, the
commit roots are switched and so the second stage will find the extent
item that the ordered extent added to the delayed refs. But this extent
was not moved (ordered extent completed after first stage finished), so
at the end of the relocation our block group item still has a positive
used bytes counter, triggering a warning at the end of
btrfs_relocate_block_group(). Later on when trying to read the extent
contents from disk we hit a BUG_ON() due to the inability to map a block
with a logical address that belongs to the block group we relocated and
is no longer valid, resulting in the following trace:

[ 7344.885290] BTRFS critical (device sdi): unable to find logical 12845056 len 4096
[ 7344.887518] ------------[ cut here ]------------
[ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833!
[ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
[ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G        W       4.5.0-rc6-btrfs-next-28+ #1
[ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: ffff880204684000
[ 7344.888431] RIP: 0010:[&lt;ffffffffa037c88c&gt;]  [&lt;ffffffffa037c88c&gt;] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
[ 7344.888431] RSP: 0018:ffff8802046878f0  EFLAGS: 00010282
[ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: 0000000000000001
[ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: 00000000ffffffff
[ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: 0000000000000000
[ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: 0000000000001000
[ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: ffff88006cd199b0
[ 7344.888431] FS:  00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
[ 7344.888431] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: 00000000000006f0
[ 7344.888431] Stack:
[ 7344.888431]  0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950
[ 7344.888431]  ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000
[ 7344.888431]  ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000
[ 7344.888431] Call Trace:
[ 7344.888431]  [&lt;ffffffffa0395c8f&gt;] submit_extent_page+0xf5/0x16f [btrfs]
[ 7344.888431]  [&lt;ffffffffa03970ac&gt;] __do_readpage+0x4a0/0x4f1 [btrfs]
[ 7344.888431]  [&lt;ffffffffa039680d&gt;] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs]
[ 7344.888431]  [&lt;ffffffffa037eeb4&gt;] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431]  [&lt;ffffffff8108df55&gt;] ? trace_hardirqs_on+0xd/0xf
[ 7344.888431]  [&lt;ffffffffa039728c&gt;] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs]
[ 7344.888431]  [&lt;ffffffffa037eeb4&gt;] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431]  [&lt;ffffffffa039739b&gt;] __extent_readpages.constprop.25+0xed/0x100 [btrfs]
[ 7344.888431]  [&lt;ffffffff81129d24&gt;] ? lru_cache_add+0xe/0x10
[ 7344.888431]  [&lt;ffffffffa0397ea8&gt;] extent_readpages+0x160/0x1aa [btrfs]
[ 7344.888431]  [&lt;ffffffffa037eeb4&gt;] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431]  [&lt;ffffffff8115daad&gt;] ? alloc_pages_current+0xa9/0xcd
[ 7344.888431]  [&lt;ffffffffa037cdc9&gt;] btrfs_readpages+0x1f/0x21 [btrfs]
[ 7344.888431]  [&lt;ffffffff81128316&gt;] __do_page_cache_readahead+0x168/0x1fc
[ 7344.888431]  [&lt;ffffffff811285a0&gt;] ondemand_readahead+0x1f6/0x207
[ 7344.888431]  [&lt;ffffffff811285a0&gt;] ? ondemand_readahead+0x1f6/0x207
[ 7344.888431]  [&lt;ffffffff8111cf34&gt;] ? pagecache_get_page+0x2b/0x154
[ 7344.888431]  [&lt;ffffffff8112870e&gt;] page_cache_sync_readahead+0x3d/0x3f
[ 7344.888431]  [&lt;ffffffff8111dbf7&gt;] generic_file_read_iter+0x197/0x4e1
[ 7344.888431]  [&lt;ffffffff8117773a&gt;] __vfs_read+0x79/0x9d
[ 7344.888431]  [&lt;ffffffff81178050&gt;] vfs_read+0x8f/0xd2
[ 7344.888431]  [&lt;ffffffff81178a38&gt;] SyS_read+0x50/0x7e
[ 7344.888431]  [&lt;ffffffff81492017&gt;] entry_SYSCALL_64_fastpath+0x12/0x6b
[ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 &lt;0f&gt; 0b 4c 0
[ 7344.888431] RIP  [&lt;ffffffffa037c88c&gt;] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
[ 7344.888431]  RSP &lt;ffff8802046878f0&gt;
[ 7344.970544] ---[ end trace eb7803b24ebab8ae ]---

Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Reviewed-by: Josef Bacik &lt;jbacik@fb.com&gt;
Reviewed-by: Liu Bo &lt;bo.li.liu@oracle.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Before we start the actual relocation process of a block group, we do
calls to flush delalloc of all inodes and then wait for ordered extents
to complete. However we do these flush calls just to make sure we don't
race with concurrent tasks that have actually already started to run
delalloc and have allocated an extent from the block group we want to
relocate, right before we set it to readonly mode, but have not yet
created the respective ordered extents. The flush calls make us wait
for such concurrent tasks because they end up calling
filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() -&gt;
__start_delalloc_inodes() -&gt; btrfs_alloc_delalloc_work() -&gt;
btrfs_run_delalloc_work()) which ends up serializing us with those tasks
due to attempts to lock the same pages (and the delalloc flush procedure
calls the allocator and creates the ordered extents before unlocking the
pages).

These flushing calls not only make us waste time (cpu, IO) but also reduce
the chances of writing larger extents (applications might be writing to
contiguous ranges and we flush before they finish dirtying the whole
ranges).

So make sure we don't flush delalloc and just wait for concurrent tasks
that have already started flushing delalloc and have allocated an extent
from the block group we are about to relocate.

This change also ends up fixing a race with direct IO writes that makes
relocation not wait for direct IO ordered extents. This race is
illustrated by the following diagram:

        CPU 1                                       CPU 2

 btrfs_relocate_block_group(bg X)

                                           starts direct IO write,
                                           target inode currently has no
                                           ordered extents ongoing nor
                                           dirty pages (delalloc regions),
                                           therefore the root for our inode
                                           is not in the list
                                           fs_info-&gt;ordered_roots

                                           btrfs_direct_IO()
                                             __blockdev_direct_IO()
                                               btrfs_get_blocks_direct()
                                                 btrfs_lock_extent_direct()
                                                   locks range in the io tree
                                                 btrfs_new_extent_direct()
                                                   btrfs_reserve_extent()
                                                     --&gt; extent allocated
                                                         from bg X

   btrfs_inc_block_group_ro(bg X)

   btrfs_start_delalloc_roots()
     __start_delalloc_inodes()
       --&gt; does nothing, no dealloc ranges
           in the inode's io tree so the
           inode's root is not in the list
           fs_info-&gt;delalloc_roots

   btrfs_wait_ordered_roots()
     --&gt; does not find the inode's root in the
         list fs_info-&gt;ordered_roots

     --&gt; ends up not waiting for the direct IO
         write started by the task at CPU 2

   relocate_block_group(rc-&gt;stage ==
     MOVE_DATA_EXTENTS)

     prepare_to_relocate()
       btrfs_commit_transaction()

     iterates the extent tree, using its
     commit root and moves extents into new
     locations

                                                   btrfs_add_ordered_extent_dio()
                                                     --&gt; now a ordered extent is
                                                         created and added to the
                                                         list root-&gt;ordered_extents
                                                         and the root added to the
                                                         list fs_info-&gt;ordered_roots
                                                     --&gt; this is too late and the
                                                         task at CPU 1 already
                                                         started the relocation

     btrfs_commit_transaction()

                                                   btrfs_finish_ordered_io()
                                                     btrfs_alloc_reserved_file_extent()
                                                       --&gt; adds delayed data reference
                                                           for the extent allocated
                                                           from bg X

   relocate_block_group(rc-&gt;stage ==
     UPDATE_DATA_PTRS)

     prepare_to_relocate()
       btrfs_commit_transaction()
         --&gt; delayed refs are run, so an extent
             item for the allocated extent from
             bg X is added to extent tree
         --&gt; commit roots are switched, so the
             next scan in the extent tree will
             see the extent item

     sees the extent in the extent tree

When this happens the relocation produces the following warning when it
finishes:

[ 7260.832836] ------------[ cut here ]------------
[ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]()
[ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
[ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1
[ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 7260.852998]  0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000
[ 7260.852998]  0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d
[ 7260.852998]  ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000
[ 7260.852998] Call Trace:
[ 7260.852998]  [&lt;ffffffff812648b3&gt;] dump_stack+0x67/0x90
[ 7260.852998]  [&lt;ffffffff81051608&gt;] warn_slowpath_common+0x99/0xb2
[ 7260.852998]  [&lt;ffffffffa03c1b2d&gt;] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
[ 7260.852998]  [&lt;ffffffff810516d4&gt;] warn_slowpath_null+0x1a/0x1c
[ 7260.852998]  [&lt;ffffffffa03c1b2d&gt;] btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
[ 7260.852998]  [&lt;ffffffffa039d9de&gt;] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs]
[ 7260.852998]  [&lt;ffffffffa039f314&gt;] btrfs_balance+0xde1/0xe4e [btrfs]
[ 7260.852998]  [&lt;ffffffff8127d671&gt;] ? debug_smp_processor_id+0x17/0x19
[ 7260.852998]  [&lt;ffffffffa03a9583&gt;] btrfs_ioctl_balance+0x255/0x2d3 [btrfs]
[ 7260.852998]  [&lt;ffffffffa03ac96a&gt;] btrfs_ioctl+0x11e0/0x1dff [btrfs]
[ 7260.852998]  [&lt;ffffffff811451df&gt;] ? handle_mm_fault+0x443/0xd63
[ 7260.852998]  [&lt;ffffffff81491817&gt;] ? _raw_spin_unlock+0x31/0x44
[ 7260.852998]  [&lt;ffffffff8108b36a&gt;] ? arch_local_irq_save+0x9/0xc
[ 7260.852998]  [&lt;ffffffff811876ab&gt;] vfs_ioctl+0x18/0x34
[ 7260.852998]  [&lt;ffffffff81187cb2&gt;] do_vfs_ioctl+0x550/0x5be
[ 7260.852998]  [&lt;ffffffff81190c30&gt;] ? __fget_light+0x4d/0x71
[ 7260.852998]  [&lt;ffffffff81187d77&gt;] SyS_ioctl+0x57/0x79
[ 7260.852998]  [&lt;ffffffff81492017&gt;] entry_SYSCALL_64_fastpath+0x12/0x6b
[ 7260.893268] ---[ end trace eb7803b24ebab8ad ]---

This is because at the end of the first stage, in relocate_block_group(),
we commit the current transaction, which makes delayed refs run, the
commit roots are switched and so the second stage will find the extent
item that the ordered extent added to the delayed refs. But this extent
was not moved (ordered extent completed after first stage finished), so
at the end of the relocation our block group item still has a positive
used bytes counter, triggering a warning at the end of
btrfs_relocate_block_group(). Later on when trying to read the extent
contents from disk we hit a BUG_ON() due to the inability to map a block
with a logical address that belongs to the block group we relocated and
is no longer valid, resulting in the following trace:

[ 7344.885290] BTRFS critical (device sdi): unable to find logical 12845056 len 4096
[ 7344.887518] ------------[ cut here ]------------
[ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833!
[ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
[ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G        W       4.5.0-rc6-btrfs-next-28+ #1
[ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: ffff880204684000
[ 7344.888431] RIP: 0010:[&lt;ffffffffa037c88c&gt;]  [&lt;ffffffffa037c88c&gt;] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
[ 7344.888431] RSP: 0018:ffff8802046878f0  EFLAGS: 00010282
[ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: 0000000000000001
[ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: 00000000ffffffff
[ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: 0000000000000000
[ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: 0000000000001000
[ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: ffff88006cd199b0
[ 7344.888431] FS:  00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
[ 7344.888431] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: 00000000000006f0
[ 7344.888431] Stack:
[ 7344.888431]  0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950
[ 7344.888431]  ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000
[ 7344.888431]  ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000
[ 7344.888431] Call Trace:
[ 7344.888431]  [&lt;ffffffffa0395c8f&gt;] submit_extent_page+0xf5/0x16f [btrfs]
[ 7344.888431]  [&lt;ffffffffa03970ac&gt;] __do_readpage+0x4a0/0x4f1 [btrfs]
[ 7344.888431]  [&lt;ffffffffa039680d&gt;] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs]
[ 7344.888431]  [&lt;ffffffffa037eeb4&gt;] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431]  [&lt;ffffffff8108df55&gt;] ? trace_hardirqs_on+0xd/0xf
[ 7344.888431]  [&lt;ffffffffa039728c&gt;] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs]
[ 7344.888431]  [&lt;ffffffffa037eeb4&gt;] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431]  [&lt;ffffffffa039739b&gt;] __extent_readpages.constprop.25+0xed/0x100 [btrfs]
[ 7344.888431]  [&lt;ffffffff81129d24&gt;] ? lru_cache_add+0xe/0x10
[ 7344.888431]  [&lt;ffffffffa0397ea8&gt;] extent_readpages+0x160/0x1aa [btrfs]
[ 7344.888431]  [&lt;ffffffffa037eeb4&gt;] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
[ 7344.888431]  [&lt;ffffffff8115daad&gt;] ? alloc_pages_current+0xa9/0xcd
[ 7344.888431]  [&lt;ffffffffa037cdc9&gt;] btrfs_readpages+0x1f/0x21 [btrfs]
[ 7344.888431]  [&lt;ffffffff81128316&gt;] __do_page_cache_readahead+0x168/0x1fc
[ 7344.888431]  [&lt;ffffffff811285a0&gt;] ondemand_readahead+0x1f6/0x207
[ 7344.888431]  [&lt;ffffffff811285a0&gt;] ? ondemand_readahead+0x1f6/0x207
[ 7344.888431]  [&lt;ffffffff8111cf34&gt;] ? pagecache_get_page+0x2b/0x154
[ 7344.888431]  [&lt;ffffffff8112870e&gt;] page_cache_sync_readahead+0x3d/0x3f
[ 7344.888431]  [&lt;ffffffff8111dbf7&gt;] generic_file_read_iter+0x197/0x4e1
[ 7344.888431]  [&lt;ffffffff8117773a&gt;] __vfs_read+0x79/0x9d
[ 7344.888431]  [&lt;ffffffff81178050&gt;] vfs_read+0x8f/0xd2
[ 7344.888431]  [&lt;ffffffff81178a38&gt;] SyS_read+0x50/0x7e
[ 7344.888431]  [&lt;ffffffff81492017&gt;] entry_SYSCALL_64_fastpath+0x12/0x6b
[ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 &lt;0f&gt; 0b 4c 0
[ 7344.888431] RIP  [&lt;ffffffffa037c88c&gt;] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
[ 7344.888431]  RSP &lt;ffff8802046878f0&gt;
[ 7344.970544] ---[ end trace eb7803b24ebab8ae ]---

Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Reviewed-by: Josef Bacik &lt;jbacik@fb.com&gt;
Reviewed-by: Liu Bo &lt;bo.li.liu@oracle.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Btrfs: don't wait for unrelated IO to finish before relocation</title>
<updated>2016-05-13T00:59:14+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2016-04-26T14:36:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=578def7c50f236432ba140d35bb7ca4ef0a1b20b'/>
<id>578def7c50f236432ba140d35bb7ca4ef0a1b20b</id>
<content type='text'>
Before the relocation process of a block group starts, it sets the block
group to readonly mode, then flushes all delalloc writes and then finally
it waits for all ordered extents to complete. This last step includes
waiting for ordered extents destinated at extents allocated in other block
groups, making us waste unecessary time.

So improve this by waiting only for ordered extents that fall into the
block group's range.

Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Reviewed-by: Josef Bacik &lt;jbacik@fb.com&gt;
Reviewed-by: Liu Bo &lt;bo.li.liu@oracle.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Before the relocation process of a block group starts, it sets the block
group to readonly mode, then flushes all delalloc writes and then finally
it waits for all ordered extents to complete. This last step includes
waiting for ordered extents destinated at extents allocated in other block
groups, making us waste unecessary time.

So improve this by waiting only for ordered extents that fall into the
block group's range.

Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Reviewed-by: Josef Bacik &lt;jbacik@fb.com&gt;
Reviewed-by: Liu Bo &lt;bo.li.liu@oracle.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>btrfs: sink gfp parameter to clear_extent_bits</title>
<updated>2016-04-29T09:01:47+00:00</updated>
<author>
<name>David Sterba</name>
<email>dsterba@suse.com</email>
</author>
<published>2016-04-26T21:54:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=91166212e0ffbb4db50aa8a238548e967008b33d'/>
<id>91166212e0ffbb4db50aa8a238548e967008b33d</id>
<content type='text'>
Callers pass GFP_NOFS and GFP_KERNEL. No need to pass the flags around.

Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Callers pass GFP_NOFS and GFP_KERNEL. No need to pass the flags around.

Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>btrfs: sink gfp parameter to set_extent_bits</title>
<updated>2016-04-29T09:01:47+00:00</updated>
<author>
<name>David Sterba</name>
<email>dsterba@suse.com</email>
</author>
<published>2016-04-26T21:54:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=ceeb0ae7bf42a3deaaaee981d2da02e5d3ad2b0f'/>
<id>ceeb0ae7bf42a3deaaaee981d2da02e5d3ad2b0f</id>
<content type='text'>
All callers pass GFP_NOFS.

Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
All callers pass GFP_NOFS.

Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>btrfs: rename btrfs_std_error to btrfs_handle_fs_error</title>
<updated>2016-04-28T08:36:54+00:00</updated>
<author>
<name>Anand Jain</name>
<email>anand.jain@oracle.com</email>
</author>
<published>2016-03-16T08:43:06+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=34d9700702f4042ce10d68a092ab7f79575e7a3b'/>
<id>34d9700702f4042ce10d68a092ab7f79575e7a3b</id>
<content type='text'>
btrfs_std_error() handles errors, puts FS into readonly mode
(as of now). So its good idea to rename it to btrfs_handle_fs_error().

Signed-off-by: Anand Jain &lt;anand.jain@oracle.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
[ edit changelog ]
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
btrfs_std_error() handles errors, puts FS into readonly mode
(as of now). So its good idea to rename it to btrfs_handle_fs_error().

Signed-off-by: Anand Jain &lt;anand.jain@oracle.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
[ edit changelog ]
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs</title>
<updated>2016-04-09T17:41:34+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2016-04-09T17:41:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=839a3f765728cdca0057a12e2dc0bf669ac1c22e'/>
<id>839a3f765728cdca0057a12e2dc0bf669ac1c22e</id>
<content type='text'>
Pull btrfs fixes from Chris Mason:
 "These are bug fixes, including a really old fsync bug, and a few trace
  points to help us track down problems in the quota code"

* 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix file/data loss caused by fsync after rename and new inode
  btrfs: Reset IO error counters before start of device replacing
  btrfs: Add qgroup tracing
  Btrfs: don't use src fd for printk
  btrfs: fallback to vmalloc in btrfs_compare_tree
  btrfs: handle non-fatal errors in btrfs_qgroup_inherit()
  btrfs: Output more info for enospc_debug mount option
  Btrfs: fix invalid reference in replace_path
  Btrfs: Improve FL_KEEP_SIZE handling in fallocate
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull btrfs fixes from Chris Mason:
 "These are bug fixes, including a really old fsync bug, and a few trace
  points to help us track down problems in the quota code"

* 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix file/data loss caused by fsync after rename and new inode
  btrfs: Reset IO error counters before start of device replacing
  btrfs: Add qgroup tracing
  Btrfs: don't use src fd for printk
  btrfs: fallback to vmalloc in btrfs_compare_tree
  btrfs: handle non-fatal errors in btrfs_qgroup_inherit()
  btrfs: Output more info for enospc_debug mount option
  Btrfs: fix invalid reference in replace_path
  Btrfs: Improve FL_KEEP_SIZE handling in fallocate
</pre>
</div>
</content>
</entry>
</feed>
