<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/fs/btrfs/relocation.c, branch v6.7</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>Merge tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux</title>
<updated>2023-10-30T20:42:06+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-10-30T20:42:06+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=d5acbc60fafbe0fc94c552ce916dd592cd4c6371'/>
<id>d5acbc60fafbe0fc94c552ce916dd592cd4c6371</id>
<content type='text'>
Pull btrfs updates from David Sterba:
 "New features:

   - raid-stripe-tree

     New tree for logical file extent mapping where the physical mapping
     may not match on multiple devices. This is now used in zoned mode
     to implement RAID0/RAID1* profiles, but can be used in non-zoned
     mode as well. The support for RAID56 is in development and will
     eventually fix the problems with the current implementation. This
     is a backward incompatible feature and has to be enabled at mkfs
     time.

   - simple quota accounting (squota)

     A simplified mode of qgroup that accounts all space on the initial
     extent owners (a subvolume), the snapshots are then cheap to create
     and delete. The deletion of snapshots in fully accounting qgroups
     is a known CPU/IO performance bottleneck.

     The squota is not suitable for the general use case but works well
     for containers where the original subvolume exists for the whole
     time. This is a backward incompatible feature as it needs extending
     some structures, but can be enabled on an existing filesystem.

   - temporary filesystem fsid (temp_fsid)

     The fsid identifies a filesystem and is hard coded in the
     structures, which disallows mounting the same fsid found on
     different devices.

     For a single device filesystem this is not strictly necessary, a
     new temporary fsid can be generated on mount e.g. after a device is
     cloned. This will be used by Steam Deck for root partition A/B
     testing, or can be used for VM root images.

  Other user visible changes:

   - filesystems with partially finished metadata_uuid conversion cannot
     be mounted anymore and the uuid fixup has to be done by btrfs-progs
     (btrfstune).

  Performance improvements:

   - reduce reservations for checksum deletions (with enabled free space
     tree by factor of 4), on a sample workload on file with many
     extents the deletion time decreased by 12%

   - make extent state merges more efficient during insertions, reduce
     rb-tree iterations (run time of critical functions reduced by 5%)

  Core changes:

   - the integrity check functionality has been removed, this was a
     debugging feature and removal does not affect other integrity
     checks like checksums or tree-checker

   - space reservation changes:

      - more efficient delayed ref reservations, this avoids building up
        too much work or overusing or exhausting the global block
        reserve in some situations

      - move delayed refs reservation to the transaction start time,
        this prevents some ENOSPC corner cases related to exhaustion of
        global reserve

      - improvements in reducing excessive reservations for block group
        items

      - adjust overcommit logic in near full situations, account for one
        more chunk to eventually allocate metadata chunk, this is mostly
        relevant for small filesystems (&lt;10GiB)

   - single device filesystems are scanned but not registered (except
     seed devices), this allows temp_fsid to work

   - qgroup iterations do not need GFP_ATOMIC allocations anymore

   - cleanups, refactoring, reduced data structure size, function
     parameter simplifications, error handling fixes"

* tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (156 commits)
  btrfs: open code timespec64 in struct btrfs_inode
  btrfs: remove redundant log root tree index assignment during log sync
  btrfs: remove redundant initialization of variable dirty in btrfs_update_time()
  btrfs: sysfs: show temp_fsid feature
  btrfs: disable the device add feature for temp-fsid
  btrfs: disable the seed feature for temp-fsid
  btrfs: update comment for temp-fsid, fsid, and metadata_uuid
  btrfs: remove pointless empty log context list check when syncing log
  btrfs: update comment for struct btrfs_inode::lock
  btrfs: remove pointless barrier from btrfs_sync_file()
  btrfs: add and use helpers for reading and writing last_trans_committed
  btrfs: add and use helpers for reading and writing fs_info-&gt;generation
  btrfs: add and use helpers for reading and writing log_transid
  btrfs: add and use helpers for reading and writing last_log_commit
  btrfs: support cloned-device mount capability
  btrfs: add helper function find_fsid_by_disk
  btrfs: stop reserving excessive space for block group item insertions
  btrfs: stop reserving excessive space for block group item updates
  btrfs: reorder btrfs_inode to fill gaps
  btrfs: open code btrfs_ordered_inode_tree in btrfs_inode
  ...
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull btrfs updates from David Sterba:
 "New features:

   - raid-stripe-tree

     New tree for logical file extent mapping where the physical mapping
     may not match on multiple devices. This is now used in zoned mode
     to implement RAID0/RAID1* profiles, but can be used in non-zoned
     mode as well. The support for RAID56 is in development and will
     eventually fix the problems with the current implementation. This
     is a backward incompatible feature and has to be enabled at mkfs
     time.

   - simple quota accounting (squota)

     A simplified mode of qgroup that accounts all space on the initial
     extent owners (a subvolume), the snapshots are then cheap to create
     and delete. The deletion of snapshots in fully accounting qgroups
     is a known CPU/IO performance bottleneck.

     The squota is not suitable for the general use case but works well
     for containers where the original subvolume exists for the whole
     time. This is a backward incompatible feature as it needs extending
     some structures, but can be enabled on an existing filesystem.

   - temporary filesystem fsid (temp_fsid)

     The fsid identifies a filesystem and is hard coded in the
     structures, which disallows mounting the same fsid found on
     different devices.

     For a single device filesystem this is not strictly necessary, a
     new temporary fsid can be generated on mount e.g. after a device is
     cloned. This will be used by Steam Deck for root partition A/B
     testing, or can be used for VM root images.

  Other user visible changes:

   - filesystems with partially finished metadata_uuid conversion cannot
     be mounted anymore and the uuid fixup has to be done by btrfs-progs
     (btrfstune).

  Performance improvements:

   - reduce reservations for checksum deletions (with enabled free space
     tree by factor of 4), on a sample workload on file with many
     extents the deletion time decreased by 12%

   - make extent state merges more efficient during insertions, reduce
     rb-tree iterations (run time of critical functions reduced by 5%)

  Core changes:

   - the integrity check functionality has been removed, this was a
     debugging feature and removal does not affect other integrity
     checks like checksums or tree-checker

   - space reservation changes:

      - more efficient delayed ref reservations, this avoids building up
        too much work or overusing or exhausting the global block
        reserve in some situations

      - move delayed refs reservation to the transaction start time,
        this prevents some ENOSPC corner cases related to exhaustion of
        global reserve

      - improvements in reducing excessive reservations for block group
        items

      - adjust overcommit logic in near full situations, account for one
        more chunk to eventually allocate metadata chunk, this is mostly
        relevant for small filesystems (&lt;10GiB)

   - single device filesystems are scanned but not registered (except
     seed devices), this allows temp_fsid to work

   - qgroup iterations do not need GFP_ATOMIC allocations anymore

   - cleanups, refactoring, reduced data structure size, function
     parameter simplifications, error handling fixes"

* tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (156 commits)
  btrfs: open code timespec64 in struct btrfs_inode
  btrfs: remove redundant log root tree index assignment during log sync
  btrfs: remove redundant initialization of variable dirty in btrfs_update_time()
  btrfs: sysfs: show temp_fsid feature
  btrfs: disable the device add feature for temp-fsid
  btrfs: disable the seed feature for temp-fsid
  btrfs: update comment for temp-fsid, fsid, and metadata_uuid
  btrfs: remove pointless empty log context list check when syncing log
  btrfs: update comment for struct btrfs_inode::lock
  btrfs: remove pointless barrier from btrfs_sync_file()
  btrfs: add and use helpers for reading and writing last_trans_committed
  btrfs: add and use helpers for reading and writing fs_info-&gt;generation
  btrfs: add and use helpers for reading and writing log_transid
  btrfs: add and use helpers for reading and writing last_log_commit
  btrfs: support cloned-device mount capability
  btrfs: add helper function find_fsid_by_disk
  btrfs: stop reserving excessive space for block group item insertions
  btrfs: stop reserving excessive space for block group item updates
  btrfs: reorder btrfs_inode to fill gaps
  btrfs: open code btrfs_ordered_inode_tree in btrfs_inode
  ...
</pre>
</div>
</content>
</entry>
<entry>
<title>btrfs: fix unwritten extent buffer after snapshotting a new subvolume</title>
<updated>2023-10-23T15:17:30+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2023-10-19T12:19:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=eb96e221937af3c7bb8a63208dbab813ca5d3d7e'/>
<id>eb96e221937af3c7bb8a63208dbab813ca5d3d7e</id>
<content type='text'>
When creating a snapshot of a subvolume that was created in the current
transaction, we can end up not persisting a dirty extent buffer that is
referenced by the snapshot, resulting in IO errors due to checksum failures
when trying to read the extent buffer later from disk. A sequence of steps
that leads to this is the following:

1) At ioctl.c:create_subvol() we allocate an extent buffer, with logical
   address 36007936, for the leaf/root of a new subvolume that has an ID
   of 291. We mark the extent buffer as dirty, and at this point the
   subvolume tree has a single node/leaf which is also its root (level 0);

2) We no longer commit the transaction used to create the subvolume at
   create_subvol(). We used to, but that was recently removed in
   commit 1b53e51a4a8f ("btrfs: don't commit transaction for every subvol
   create");

3) The transaction used to create the subvolume has an ID of 33, so the
   extent buffer 36007936 has a generation of 33;

4) Several updates happen to subvolume 291 during transaction 33, several
   files created and its tree height changes from 0 to 1, so we end up with
   a new root at level 1 and the extent buffer 36007936 is now a leaf of
   that new root node, which is extent buffer 36048896.

   The commit root remains as 36007936, since we are still at transaction
   33;

5) Creation of a snapshot of subvolume 291, with an ID of 292, starts at
   ioctl.c:create_snapshot(). This triggers a commit of transaction 33 and
   we end up at transaction.c:create_pending_snapshot(), in the critical
   section of a transaction commit.

   There we COW the root of subvolume 291, which is extent buffer 36048896.
   The COW operation returns extent buffer 36048896, since there's no need
   to COW because the extent buffer was created in this transaction and it
   was not written yet.

   The we call btrfs_copy_root() against the root node 36048896. During
   this operation we allocate a new extent buffer to turn into the root
   node of the snapshot, copy the contents of the root node 36048896 into
   this snapshot root extent buffer, set the owner to 292 (the ID of the
   snapshot), etc, and then we call btrfs_inc_ref(). This will create a
   delayed reference for each leaf pointed by the root node with a
   reference root of 292 - this includes a reference for the leaf
   36007936.

   After that we set the bit BTRFS_ROOT_FORCE_COW in the root's state.

   Then we call btrfs_insert_dir_item(), to create the directory entry in
   in the tree of subvolume 291 that points to the snapshot. This ends up
   needing to modify leaf 36007936 to insert the respective directory
   items. Because the bit BTRFS_ROOT_FORCE_COW is set for the root's state,
   we need to COW the leaf. We end up at btrfs_force_cow_block() and then
   at update_ref_for_cow().

   At update_ref_for_cow() we call btrfs_block_can_be_shared() which
   returns false, despite the fact the leaf 36007936 is shared - the
   subvolume's root and the snapshot's root point to that leaf. The
   reason that it incorrectly returns false is because the commit root
   of the subvolume is extent buffer 36007936 - it was the initial root
   of the subvolume when we created it. So btrfs_block_can_be_shared()
   which has the following logic:

   int btrfs_block_can_be_shared(struct btrfs_root *root,
                                 struct extent_buffer *buf)
   {
       if (test_bit(BTRFS_ROOT_SHAREABLE, &amp;root-&gt;state) &amp;&amp;
           buf != root-&gt;node &amp;&amp; buf != root-&gt;commit_root &amp;&amp;
           (btrfs_header_generation(buf) &lt;=
            btrfs_root_last_snapshot(&amp;root-&gt;root_item) ||
            btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)))
               return 1;

       return 0;
   }

   Returns false (0) since 'buf' (extent buffer 36007936) matches the
   root's commit root.

   As a result, at update_ref_for_cow(), we don't check for the number
   of references for extent buffer 36007936, we just assume it's not
   shared and therefore that it has only 1 reference, so we set the local
   variable 'refs' to 1.

   Later on, in the final if-else statement at update_ref_for_cow():

   static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans,
                                          struct btrfs_root *root,
                                          struct extent_buffer *buf,
                                          struct extent_buffer *cow,
                                          int *last_ref)
   {
      (...)
      if (refs &gt; 1) {
          (...)
      } else {
          (...)
          btrfs_clear_buffer_dirty(trans, buf);
          *last_ref = 1;
      }
   }

   So we mark the extent buffer 36007936 as not dirty, and as a result
   we don't write it to disk later in the transaction commit, despite the
   fact that the snapshot's root points to it.

   Attempting to access the leaf or dumping the tree for example shows
   that the extent buffer was not written:

   $ btrfs inspect-internal dump-tree -t 292 /dev/sdb
   btrfs-progs v6.2.2
   file tree key (292 ROOT_ITEM 33)
   node 36110336 level 1 items 2 free space 119 generation 33 owner 292
   node 36110336 flags 0x1(WRITTEN) backref revision 1
   checksum stored a8103e3e
   checksum calced a8103e3e
   fs uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
   chunk uuid e8c9c885-78f4-4d31-85fe-89e5f5fd4a07
           key (256 INODE_ITEM 0) block 36007936 gen 33
           key (257 EXTENT_DATA 0) block 36052992 gen 33
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   total bytes 107374182400
   bytes used 38572032
   uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79

   The respective on disk region is full of zeroes as the device was
   trimmed at mkfs time.

   Obviously 'btrfs check' also detects and complains about this:

   $ btrfs check /dev/sdb
   Opening filesystem to check...
   Checking filesystem on /dev/sdb
   UUID: 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
   generation: 33 (33)
   [1/7] checking root items
   [2/7] checking extents
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   bad tree block 36007936, bytenr mismatch, want=36007936, have=0
   owner ref check failed [36007936 4096]
   ERROR: errors found in extent allocation tree or chunk allocation
   [3/7] checking free space tree
   [4/7] checking fs roots
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   bad tree block 36007936, bytenr mismatch, want=36007936, have=0
   The following tree block(s) is corrupted in tree 292:
        tree block bytenr: 36110336, level: 1, node key: (256, 1, 0)
   root 292 root dir 256 not found
   ERROR: errors found in fs roots
   found 38572032 bytes used, error(s) found
   total csum bytes: 16048
   total tree bytes: 1265664
   total fs tree bytes: 1118208
   total extent tree bytes: 65536
   btree space waste bytes: 562598
   file data blocks allocated: 65978368
    referenced 36569088

Fix this by updating btrfs_block_can_be_shared() to consider that an
extent buffer may be shared if it matches the commit root and if its
generation matches the current transaction's generation.

This can be reproduced with the following script:

   $ cat test.sh
   #!/bin/bash

   MNT=/mnt/sdi
   DEV=/dev/sdi

   # Use a filesystem with a 64K node size so that we have the same node
   # size on every machine regardless of its page size (on x86_64 default
   # node size is 16K due to the 4K page size, while on PPC it's 64K by
   # default). This way we can make sure we are able to create a btree for
   # the subvolume with a height of 2.
   mkfs.btrfs -f -n 64K $DEV
   mount $DEV $MNT

   btrfs subvolume create $MNT/subvol

   # Create a few empty files on the subvolume, this bumps its btree
   # height to 2 (root node at level 1 and 2 leaves).
   for ((i = 1; i &lt;= 300; i++)); do
       echo -n &gt; $MNT/subvol/file_$i
   done

   btrfs subvolume snapshot -r $MNT/subvol $MNT/subvol/snap

   umount $DEV

   btrfs check $DEV

Running it on a 6.5 kernel (or any 6.6-rc kernel at the moment):

   $ ./test.sh
   Create subvolume '/mnt/sdi/subvol'
   Create a readonly snapshot of '/mnt/sdi/subvol' in '/mnt/sdi/subvol/snap'
   Opening filesystem to check...
   Checking filesystem on /dev/sdi
   UUID: bbdde2ff-7d02-45ca-8a73-3c36f23755a1
   [1/7] checking root items
   [2/7] checking extents
   parent transid verify failed on 30539776 wanted 7 found 5
   parent transid verify failed on 30539776 wanted 7 found 5
   parent transid verify failed on 30539776 wanted 7 found 5
   Ignoring transid failure
   owner ref check failed [30539776 65536]
   ERROR: errors found in extent allocation tree or chunk allocation
   [3/7] checking free space tree
   [4/7] checking fs roots
   parent transid verify failed on 30539776 wanted 7 found 5
   Ignoring transid failure
   Wrong key of child node/leaf, wanted: (256, 1, 0), have: (2, 132, 0)
   Wrong generation of child node/leaf, wanted: 5, have: 7
   root 257 root dir 256 not found
   ERROR: errors found in fs roots
   found 917504 bytes used, error(s) found
   total csum bytes: 0
   total tree bytes: 851968
   total fs tree bytes: 393216
   total extent tree bytes: 65536
   btree space waste bytes: 736550
   file data blocks allocated: 0
    referenced 0

A test case for fstests will follow soon.

Fixes: 1b53e51a4a8f ("btrfs: don't commit transaction for every subvol create")
CC: stable@vger.kernel.org # 6.5+
Reviewed-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When creating a snapshot of a subvolume that was created in the current
transaction, we can end up not persisting a dirty extent buffer that is
referenced by the snapshot, resulting in IO errors due to checksum failures
when trying to read the extent buffer later from disk. A sequence of steps
that leads to this is the following:

1) At ioctl.c:create_subvol() we allocate an extent buffer, with logical
   address 36007936, for the leaf/root of a new subvolume that has an ID
   of 291. We mark the extent buffer as dirty, and at this point the
   subvolume tree has a single node/leaf which is also its root (level 0);

2) We no longer commit the transaction used to create the subvolume at
   create_subvol(). We used to, but that was recently removed in
   commit 1b53e51a4a8f ("btrfs: don't commit transaction for every subvol
   create");

3) The transaction used to create the subvolume has an ID of 33, so the
   extent buffer 36007936 has a generation of 33;

4) Several updates happen to subvolume 291 during transaction 33, several
   files created and its tree height changes from 0 to 1, so we end up with
   a new root at level 1 and the extent buffer 36007936 is now a leaf of
   that new root node, which is extent buffer 36048896.

   The commit root remains as 36007936, since we are still at transaction
   33;

5) Creation of a snapshot of subvolume 291, with an ID of 292, starts at
   ioctl.c:create_snapshot(). This triggers a commit of transaction 33 and
   we end up at transaction.c:create_pending_snapshot(), in the critical
   section of a transaction commit.

   There we COW the root of subvolume 291, which is extent buffer 36048896.
   The COW operation returns extent buffer 36048896, since there's no need
   to COW because the extent buffer was created in this transaction and it
   was not written yet.

   The we call btrfs_copy_root() against the root node 36048896. During
   this operation we allocate a new extent buffer to turn into the root
   node of the snapshot, copy the contents of the root node 36048896 into
   this snapshot root extent buffer, set the owner to 292 (the ID of the
   snapshot), etc, and then we call btrfs_inc_ref(). This will create a
   delayed reference for each leaf pointed by the root node with a
   reference root of 292 - this includes a reference for the leaf
   36007936.

   After that we set the bit BTRFS_ROOT_FORCE_COW in the root's state.

   Then we call btrfs_insert_dir_item(), to create the directory entry in
   in the tree of subvolume 291 that points to the snapshot. This ends up
   needing to modify leaf 36007936 to insert the respective directory
   items. Because the bit BTRFS_ROOT_FORCE_COW is set for the root's state,
   we need to COW the leaf. We end up at btrfs_force_cow_block() and then
   at update_ref_for_cow().

   At update_ref_for_cow() we call btrfs_block_can_be_shared() which
   returns false, despite the fact the leaf 36007936 is shared - the
   subvolume's root and the snapshot's root point to that leaf. The
   reason that it incorrectly returns false is because the commit root
   of the subvolume is extent buffer 36007936 - it was the initial root
   of the subvolume when we created it. So btrfs_block_can_be_shared()
   which has the following logic:

   int btrfs_block_can_be_shared(struct btrfs_root *root,
                                 struct extent_buffer *buf)
   {
       if (test_bit(BTRFS_ROOT_SHAREABLE, &amp;root-&gt;state) &amp;&amp;
           buf != root-&gt;node &amp;&amp; buf != root-&gt;commit_root &amp;&amp;
           (btrfs_header_generation(buf) &lt;=
            btrfs_root_last_snapshot(&amp;root-&gt;root_item) ||
            btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)))
               return 1;

       return 0;
   }

   Returns false (0) since 'buf' (extent buffer 36007936) matches the
   root's commit root.

   As a result, at update_ref_for_cow(), we don't check for the number
   of references for extent buffer 36007936, we just assume it's not
   shared and therefore that it has only 1 reference, so we set the local
   variable 'refs' to 1.

   Later on, in the final if-else statement at update_ref_for_cow():

   static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans,
                                          struct btrfs_root *root,
                                          struct extent_buffer *buf,
                                          struct extent_buffer *cow,
                                          int *last_ref)
   {
      (...)
      if (refs &gt; 1) {
          (...)
      } else {
          (...)
          btrfs_clear_buffer_dirty(trans, buf);
          *last_ref = 1;
      }
   }

   So we mark the extent buffer 36007936 as not dirty, and as a result
   we don't write it to disk later in the transaction commit, despite the
   fact that the snapshot's root points to it.

   Attempting to access the leaf or dumping the tree for example shows
   that the extent buffer was not written:

   $ btrfs inspect-internal dump-tree -t 292 /dev/sdb
   btrfs-progs v6.2.2
   file tree key (292 ROOT_ITEM 33)
   node 36110336 level 1 items 2 free space 119 generation 33 owner 292
   node 36110336 flags 0x1(WRITTEN) backref revision 1
   checksum stored a8103e3e
   checksum calced a8103e3e
   fs uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
   chunk uuid e8c9c885-78f4-4d31-85fe-89e5f5fd4a07
           key (256 INODE_ITEM 0) block 36007936 gen 33
           key (257 EXTENT_DATA 0) block 36052992 gen 33
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   total bytes 107374182400
   bytes used 38572032
   uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79

   The respective on disk region is full of zeroes as the device was
   trimmed at mkfs time.

   Obviously 'btrfs check' also detects and complains about this:

   $ btrfs check /dev/sdb
   Opening filesystem to check...
   Checking filesystem on /dev/sdb
   UUID: 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
   generation: 33 (33)
   [1/7] checking root items
   [2/7] checking extents
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   bad tree block 36007936, bytenr mismatch, want=36007936, have=0
   owner ref check failed [36007936 4096]
   ERROR: errors found in extent allocation tree or chunk allocation
   [3/7] checking free space tree
   [4/7] checking fs roots
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
   bad tree block 36007936, bytenr mismatch, want=36007936, have=0
   The following tree block(s) is corrupted in tree 292:
        tree block bytenr: 36110336, level: 1, node key: (256, 1, 0)
   root 292 root dir 256 not found
   ERROR: errors found in fs roots
   found 38572032 bytes used, error(s) found
   total csum bytes: 16048
   total tree bytes: 1265664
   total fs tree bytes: 1118208
   total extent tree bytes: 65536
   btree space waste bytes: 562598
   file data blocks allocated: 65978368
    referenced 36569088

Fix this by updating btrfs_block_can_be_shared() to consider that an
extent buffer may be shared if it matches the commit root and if its
generation matches the current transaction's generation.

This can be reproduced with the following script:

   $ cat test.sh
   #!/bin/bash

   MNT=/mnt/sdi
   DEV=/dev/sdi

   # Use a filesystem with a 64K node size so that we have the same node
   # size on every machine regardless of its page size (on x86_64 default
   # node size is 16K due to the 4K page size, while on PPC it's 64K by
   # default). This way we can make sure we are able to create a btree for
   # the subvolume with a height of 2.
   mkfs.btrfs -f -n 64K $DEV
   mount $DEV $MNT

   btrfs subvolume create $MNT/subvol

   # Create a few empty files on the subvolume, this bumps its btree
   # height to 2 (root node at level 1 and 2 leaves).
   for ((i = 1; i &lt;= 300; i++)); do
       echo -n &gt; $MNT/subvol/file_$i
   done

   btrfs subvolume snapshot -r $MNT/subvol $MNT/subvol/snap

   umount $DEV

   btrfs check $DEV

Running it on a 6.5 kernel (or any 6.6-rc kernel at the moment):

   $ ./test.sh
   Create subvolume '/mnt/sdi/subvol'
   Create a readonly snapshot of '/mnt/sdi/subvol' in '/mnt/sdi/subvol/snap'
   Opening filesystem to check...
   Checking filesystem on /dev/sdi
   UUID: bbdde2ff-7d02-45ca-8a73-3c36f23755a1
   [1/7] checking root items
   [2/7] checking extents
   parent transid verify failed on 30539776 wanted 7 found 5
   parent transid verify failed on 30539776 wanted 7 found 5
   parent transid verify failed on 30539776 wanted 7 found 5
   Ignoring transid failure
   owner ref check failed [30539776 65536]
   ERROR: errors found in extent allocation tree or chunk allocation
   [3/7] checking free space tree
   [4/7] checking fs roots
   parent transid verify failed on 30539776 wanted 7 found 5
   Ignoring transid failure
   Wrong key of child node/leaf, wanted: (256, 1, 0), have: (2, 132, 0)
   Wrong generation of child node/leaf, wanted: 5, have: 7
   root 257 root dir 256 not found
   ERROR: errors found in fs roots
   found 917504 bytes used, error(s) found
   total csum bytes: 0
   total tree bytes: 851968
   total fs tree bytes: 393216
   total extent tree bytes: 65536
   btree space waste bytes: 736550
   file data blocks allocated: 0
    referenced 0

A test case for fstests will follow soon.

Fixes: 1b53e51a4a8f ("btrfs: don't commit transaction for every subvol create")
CC: stable@vger.kernel.org # 6.5+
Reviewed-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>btrfs: change test_range_bit to scan the whole range</title>
<updated>2023-10-12T14:44:14+00:00</updated>
<author>
<name>David Sterba</name>
<email>dsterba@suse.com</email>
</author>
<published>2020-08-14T09:35:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=893fe2439994666d94dbe19f5fd59cec17c8f0a8'/>
<id>893fe2439994666d94dbe19f5fd59cec17c8f0a8</id>
<content type='text'>
The semantics of test_range_bit() with filled == 0 is now in it's own
helper so test_range_bit will check the whole range unconditionally.
The detection logic is flipped and assumes success by default and
catches exceptions.

Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The semantics of test_range_bit() with filled == 0 is now in it's own
helper so test_range_bit will check the whole range unconditionally.
The detection logic is flipped and assumes success by default and
catches exceptions.

Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>btrfs: relocation: constify parameters where possible</title>
<updated>2023-10-12T14:44:13+00:00</updated>
<author>
<name>David Sterba</name>
<email>dsterba@suse.com</email>
</author>
<published>2023-09-22T11:07:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=ab7c8bbf3a088730e58da224bcad512f1dd9ca74'/>
<id>ab7c8bbf3a088730e58da224bcad512f1dd9ca74</id>
<content type='text'>
Lots of the functions in relocation.c don't change pointer parameters
but lack the annotations. Add them and reformat according to current
coding style if needed.

Reviewed-by: Johannes Thumshirn &lt;johannes.thumshirn@wdc.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Lots of the functions in relocation.c don't change pointer parameters
but lack the annotations. Add them and reformat according to current
coding style if needed.

Reviewed-by: Johannes Thumshirn &lt;johannes.thumshirn@wdc.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>btrfs: relocation: return bool from btrfs_should_ignore_reloc_root</title>
<updated>2023-10-12T14:44:13+00:00</updated>
<author>
<name>David Sterba</name>
<email>dsterba@suse.com</email>
</author>
<published>2023-09-22T11:07:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=32f2abca380fedc60f7a8d3288e4c9586672e207'/>
<id>32f2abca380fedc60f7a8d3288e4c9586672e207</id>
<content type='text'>
btrfs_should_ignore_reloc_root() is a predicate so it should return
bool.

Reviewed-by: Johannes Thumshirn &lt;johannes.thumshirn@wdc.com&gt;
Reviewed-by: Qu Wenruo &lt;wqu@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
btrfs_should_ignore_reloc_root() is a predicate so it should return
bool.

Reviewed-by: Johannes Thumshirn &lt;johannes.thumshirn@wdc.com&gt;
Reviewed-by: Qu Wenruo &lt;wqu@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>btrfs: switch btrfs_backref_cache::is_reloc to bool</title>
<updated>2023-10-12T14:44:13+00:00</updated>
<author>
<name>David Sterba</name>
<email>dsterba@suse.com</email>
</author>
<published>2023-09-22T11:07:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=c71d3c698cb53f9deff82ac71ba576c571fe8c8f'/>
<id>c71d3c698cb53f9deff82ac71ba576c571fe8c8f</id>
<content type='text'>
The btrfs_backref_cache::is_reloc is an indicator variable and should
use a bool type.

Reviewed-by: Johannes Thumshirn &lt;johannes.thumshirn@wdc.com&gt;
Reviewed-by: Qu Wenruo &lt;wqu@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The btrfs_backref_cache::is_reloc is an indicator variable and should
use a bool type.

Reviewed-by: Johannes Thumshirn &lt;johannes.thumshirn@wdc.com&gt;
Reviewed-by: Qu Wenruo &lt;wqu@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>btrfs: relocation: open code mapping_tree_init</title>
<updated>2023-10-12T14:44:13+00:00</updated>
<author>
<name>David Sterba</name>
<email>dsterba@suse.com</email>
</author>
<published>2023-09-22T11:07:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=733fa44de3bc936103f3a15fbdad6c545da8f248'/>
<id>733fa44de3bc936103f3a15fbdad6c545da8f248</id>
<content type='text'>
There's only one user of mapping_tree_init, we don't need a helper for
the simple initialization.

Reviewed-by: Johannes Thumshirn &lt;johannes.thumshirn@wdc.com&gt;
Reviewed-by: Qu Wenruo &lt;wqu@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
There's only one user of mapping_tree_init, we don't need a helper for
the simple initialization.

Reviewed-by: Johannes Thumshirn &lt;johannes.thumshirn@wdc.com&gt;
Reviewed-by: Qu Wenruo &lt;wqu@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>btrfs: relocation: switch bitfields to bool in reloc_control</title>
<updated>2023-10-12T14:44:13+00:00</updated>
<author>
<name>David Sterba</name>
<email>dsterba@suse.com</email>
</author>
<published>2023-09-22T11:07:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=d23d42e39b03f3fb5fd06e8c8a38447beae8e217'/>
<id>d23d42e39b03f3fb5fd06e8c8a38447beae8e217</id>
<content type='text'>
Use bool types for the indicators instead of bitfields. The structure
size slightly grows but the new types are placed within the padding.

Reviewed-by: Johannes Thumshirn &lt;johannes.thumshirn@wdc.com&gt;
Reviewed-by: Qu Wenruo &lt;wqu@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Use bool types for the indicators instead of bitfields. The structure
size slightly grows but the new types are placed within the padding.

Reviewed-by: Johannes Thumshirn &lt;johannes.thumshirn@wdc.com&gt;
Reviewed-by: Qu Wenruo &lt;wqu@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>btrfs: relocation: use enum for stages</title>
<updated>2023-10-12T14:44:13+00:00</updated>
<author>
<name>David Sterba</name>
<email>dsterba@suse.com</email>
</author>
<published>2023-09-22T11:07:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=8daf07cf2b7919e7c2cf6249b19ff9bb88f95c9d'/>
<id>8daf07cf2b7919e7c2cf6249b19ff9bb88f95c9d</id>
<content type='text'>
Add an enum type for data relocation stages.

Reviewed-by: Johannes Thumshirn &lt;johannes.thumshirn@wdc.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Add an enum type for data relocation stages.

Reviewed-by: Johannes Thumshirn &lt;johannes.thumshirn@wdc.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>btrfs: relocation: use more natural types for tree_block bitfields</title>
<updated>2023-10-12T14:44:13+00:00</updated>
<author>
<name>David Sterba</name>
<email>dsterba@suse.com</email>
</author>
<published>2023-09-22T11:07:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=a3bb700f43a1c5800bd5fc35cf6182f8ecadf58f'/>
<id>a3bb700f43a1c5800bd5fc35cf6182f8ecadf58f</id>
<content type='text'>
We don't need to use bitfields for tree_block::level and
tree_block::key_ready, there's enough padding in the structure for
proper types.

Reviewed-by: Johannes Thumshirn &lt;johannes.thumshirn@wdc.com&gt;
Reviewed-by: Qu Wenruo &lt;wqu@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We don't need to use bitfields for tree_block::level and
tree_block::key_ready, there's enough padding in the structure for
proper types.

Reviewed-by: Johannes Thumshirn &lt;johannes.thumshirn@wdc.com&gt;
Reviewed-by: Qu Wenruo &lt;wqu@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</pre>
</div>
</content>
</entry>
</feed>
