linux-stable.git/fs, branch v6.5.4

btrfs: scrub: fix grouping of read IO

2023-09-19T10:30:24+00:00

commit ae76d8e3e1351aa1ba09cc68dab6866d356f2e17 upstream.

[REGRESSION]
There are several regression reports about the scrub performance with
v6.4 kernel.

On a PCIe 3.0 device, the old v6.3 kernel can go 3GB/s scrub speed, but
v6.4 can only go 1GB/s, an obvious 66% performance drop.

[CAUSE]
Iostat shows a very different behavior between v6.3 and v6.4 kernel:

  Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz aqu-sz  %util
  nvme0n1p3  9731.00 3425544.00 17237.00  63.92    2.18   352.02  21.18 100.00
  nvme0n1p3 15578.00  993616.00     5.00   0.03    0.09    63.78   1.32 100.00

The upper one is v6.3 while the lower one is v6.4.

There are several obvious differences:

- Very few read merges
  This turns out to be a behavior change that we no longer do bio
  plug/unplug.

- Very low aqu-sz
  This is due to the submit-and-wait behavior of flush_scrub_stripes(),
  and extra extent/csum tree search.

Both behaviors are not that obvious on SATA SSDs, as SATA SSDs have NCQ
to merge the reads, while SATA SSDs can not handle high queue depth well
either.

[FIX]
For now this patch focuses on the read speed fix. Dev-replace replace
speed needs more work.

For the read part, we go two directions to fix the problems:

- Re-introduce blk plug/unplug to merge read requests
  This is pretty simple, and the behavior is pretty easy to observe.

  This would enlarge the average read request size to 512K.

- Introduce multi-group reads and no longer wait for each group
  Instead of the old behavior, which submits 8 stripes and waits for
  them, here we would enlarge the total number of stripes to 16 * 8.
  Which is 8M per device, the same limit as the old scrub in-flight
  bios size limit.

  Now every time we fill a group (8 stripes), we submit them and
  continue to next stripes.

  Only when the full 16 * 8 stripes are all filled, we submit the
  remaining ones (the last group), and wait for all groups to finish.
  Then submit the repair writes and dev-replace writes.

  This should enlarge the queue depth.

This would greatly improve the merge rate (thus read block size) and
queue depth:

Before (with regression, and cached extent/csum path):

 Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz aqu-sz  %util
 nvme0n1p3 20666.00 1318240.00    10.00   0.05    0.08    63.79   1.63 100.00

After (with all patches applied):

 nvme0n1p3  5165.00 2278304.00 30557.00  85.54    0.55   441.10   2.81 100.00

i.e. 1287 to 2224 MB/s.

CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba 
Signed-off-by: Greg Kroah-Hartman

btrfs: scrub: avoid unnecessary csum tree search preparing stripes

2023-09-19T10:30:24+00:00

commit 3c771c194402ffe20d4de68d9fc21e703179a9ce upstream.

One of the bottleneck of the new scrub code is the extra csum tree
search.

The old code would only do the csum tree search for each scrub bio,
which can be as large as 512KiB, thus they can afford to allocate a new
path each time.

But the new scrub code is doing csum tree search for each stripe, which
is only 64KiB, this means we'd better re-use the same csum path during
each search.

This patch would introduce a per-sctx path for csum tree search, as we
don't need to re-allocate the path every time we need to do a csum tree
search.

With this change we can further improve the queue depth and improve the
scrub read performance:

Before (with regression and cached extent tree path):

 Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz aqu-sz  %util
 nvme0n1p3 15875.00 1013328.00    12.00   0.08    0.08    63.83   1.35 100.00

After (with both cached extent/csum tree path):

 nvme0n1p3 17759.00 1133280.00    10.00   0.06    0.08    63.81   1.50 100.00

Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba 
Signed-off-by: Greg Kroah-Hartman

btrfs: scrub: avoid unnecessary extent tree search preparing stripes

2023-09-19T10:30:24+00:00

commit 1dc4888e725dc748b82858984f2a5bd41efc5201 upstream.

Since commit e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror()
to scrub_stripe infrastructure"), scrub no longer re-use the same path
for extent tree search.

This can lead to unnecessary extent tree search, especially for the new
stripe based scrub, as we have way more stripes to prepare.

This patch would re-introduce a shared path for extent tree search, and
properly release it when the block group is scrubbed.

This change alone can improve scrub performance slightly by reducing the
time spend preparing the stripe thus improving the queue depth.

Before (with regression):

 Device         r/s      rkB/s   rrqm/s  %rrqm r_await rareq-sz aqu-sz  %util
 nvme0n1p3 15578.00  993616.00     5.00   0.03    0.09    63.78   1.32 100.00

After (with this patch):

 nvme0n1p3 15875.00 1013328.00    12.00   0.08    0.08    63.83   1.35 100.00

Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
CC: stable@vger.kernel.org # 6.4+
Signed-off-by: Qu Wenruo 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba 
Signed-off-by: Greg Kroah-Hartman

btrfs: use the correct superblock to compare fsid in btrfs_validate_super

2023-09-19T10:30:24+00:00

commit d167aa76dc0683828588c25767da07fb549e4f48 upstream.

The function btrfs_validate_super() should verify the fsid in the provided
superblock argument. Because, all its callers expect it to do that.

Such as in the following stack:

   write_all_supers()
       sb = fs_info->super_for_commit;
       btrfs_validate_write_super(.., sb)
         btrfs_validate_super(.., sb, ..)

   scrub_one_super()
	btrfs_validate_super(.., sb, ..)

And
   check_dev_super()
	btrfs_validate_super(.., sb, ..)

However, it currently verifies the fs_info::super_copy::fsid instead,
which is not correct.  Fix this using the correct fsid in the superblock
argument.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn 
Tested-by: Guilherme G. Piccoli 
Signed-off-by: Anand Jain 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba 
Signed-off-by: Greg Kroah-Hartman

btrfs: zoned: re-enable metadata over-commit for zoned mode

2023-09-19T10:30:24+00:00

commit 5b135b382a360f4c87cf8896d1465b0b07f10cb0 upstream.

Now that, we can re-enable metadata over-commit. As we moved the activation
from the reservation time to the write time, we no longer need to ensure
all the reserved bytes is properly activated.

Without the metadata over-commit, it suffers from lower performance because
it needs to flush the delalloc items more often and allocate more block
groups. Re-enabling metadata over-commit will solve the issue.

Fixes: 79417d040f4f ("btrfs: zoned: disable metadata overcommit for zoned")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn 
Signed-off-by: Naohiro Aota 
Signed-off-by: David Sterba 
Signed-off-by: Greg Kroah-Hartman

btrfs: set page extent mapped after read_folio in relocate_one_page

2023-09-19T10:30:24+00:00

commit e7f1326cc24e22b38afc3acd328480a1183f9e79 upstream.

One of the CI runs triggered the following panic

  assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:229
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/subpage.c:229!
  Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
  CPU: 0 PID: 923660 Comm: btrfs Not tainted 6.5.0-rc3+ #1
  pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
  pc : btrfs_subpage_assert+0xbc/0xf0
  lr : btrfs_subpage_assert+0xbc/0xf0
  sp : ffff800093213720
  x29: ffff800093213720 x28: ffff8000932138b4 x27: 000000000c280000
  x26: 00000001b5d00000 x25: 000000000c281000 x24: 000000000c281fff
  x23: 0000000000001000 x22: 0000000000000000 x21: ffffff42b95bf880
  x20: ffff42b9528e0000 x19: 0000000000001000 x18: ffffffffffffffff
  x17: 667274622f736620 x16: 6e69202c65746176 x15: 0000000000000028
  x14: 0000000000000003 x13: 00000000002672d7 x12: 0000000000000000
  x11: ffffcd3f0ccd9204 x10: ffffcd3f0554ae50 x9 : ffffcd3f0379528c
  x8 : ffff800093213428 x7 : 0000000000000000 x6 : ffffcd3f091771e8
  x5 : ffff42b97f333948 x4 : 0000000000000000 x3 : 0000000000000000
  x2 : 0000000000000000 x1 : ffff42b9556cde80 x0 : 000000000000004f
  Call trace:
   btrfs_subpage_assert+0xbc/0xf0
   btrfs_subpage_set_dirty+0x38/0xa0
   btrfs_page_set_dirty+0x58/0x88
   relocate_one_page+0x204/0x5f0
   relocate_file_extent_cluster+0x11c/0x180
   relocate_data_extent+0xd0/0xf8
   relocate_block_group+0x3d0/0x4e8
   btrfs_relocate_block_group+0x2d8/0x490
   btrfs_relocate_chunk+0x54/0x1a8
   btrfs_balance+0x7f4/0x1150
   btrfs_ioctl+0x10f0/0x20b8
   __arm64_sys_ioctl+0x120/0x11d8
   invoke_syscall.constprop.0+0x80/0xd8
   do_el0_svc+0x6c/0x158
   el0_svc+0x50/0x1b0
   el0t_64_sync_handler+0x120/0x130
   el0t_64_sync+0x194/0x198
  Code: 91098021 b0007fa0 91346000 97e9c6d2 (d4210000)

This is the same problem outlined in 17b17fcd6d44 ("btrfs:
set_page_extent_mapped after read_folio in btrfs_cont_expand") , and the
fix is the same.  I originally looked for the same pattern elsewhere in
our code, but mistakenly skipped over this code because I saw the page
cache readahead before we set_page_extent_mapped, not realizing that
this was only in the !page case, that we can still end up with a
!uptodate page and then do the btrfs_read_folio further down.

The fix here is the same as the above mentioned patch, move the
set_page_extent_mapped call to after the btrfs_read_folio() block to
make sure that we have the subpage blocksize stuff setup properly before
using the page.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana 
Signed-off-by: Josef Bacik 
Signed-off-by: David Sterba 
Signed-off-by: Greg Kroah-Hartman

btrfs: don't start transaction when joining with TRANS_JOIN_NOSTART

2023-09-19T10:30:24+00:00

commit 4490e803e1fe9fab8db5025e44e23b55df54078b upstream.

When joining a transaction with TRANS_JOIN_NOSTART, if we don't find a
running transaction we end up creating one. This goes against the purpose
of TRANS_JOIN_NOSTART which is to join a running transaction if its state
is at or below the state TRANS_STATE_COMMIT_START, otherwise return an
-ENOENT error and don't start a new transaction. So fix this to not create
a new transaction if there's no running transaction at or below that
state.

CC: stable@vger.kernel.org # 4.14+
Fixes: a6d155d2e363 ("Btrfs: fix deadlock between fiemap and transaction commits")
Signed-off-by: Filipe Manana 
Signed-off-by: David Sterba 
Signed-off-by: Greg Kroah-Hartman

btrfs: free qgroup rsv on io failure

2023-09-19T10:30:24+00:00

commit e28b02118b94e42be3355458a2406c6861e2dd32 upstream.

If we do a write whose bio suffers an error, we will never reclaim the
qgroup reserved space for it. We allocate the space in the write_iter
codepath, then release the reservation as we allocate the ordered
extent, but we only create a delayed ref if the ordered extent finishes.
If it has an error, we simply leak the rsv. This is apparent in running
any error injecting (dmerror) fstests like btrfs/146 or btrfs/160. Such
tests fail due to dmesg on umount complaining about the leaked qgroup
data space.

When we clean up other aspects of space on failed ordered_extents, also
free the qgroup rsv.

Reviewed-by: Josef Bacik 
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Boris Burkov 
Signed-off-by: David Sterba 
Signed-off-by: Greg Kroah-Hartman

btrfs: fix start transaction qgroup rsv double free

2023-09-19T10:30:23+00:00

commit a6496849671a5bc9218ecec25a983253b34351b1 upstream.

btrfs_start_transaction reserves metadata space of the PERTRANS type
before it identifies a transaction to start/join. This allows flushing
when reserving that space without a deadlock. However, it results in a
race which temporarily breaks qgroup rsv accounting.

T1                                              T2
start_transaction
do_stuff
                                            start_transaction
                                                qgroup_reserve_meta_pertrans
commit_transaction
    qgroup_free_meta_all_pertrans
                                            hit an error starting txn
                                            goto reserve_fail
                                            qgroup_free_meta_pertrans (already freed!)

The basic issue is that there is nothing preventing another commit from
committing before start_transaction finishes (in fact sometimes we
intentionally wait for it) so any error path that frees the reserve is
at risk of this race.

While this exact space was getting freed anyway, and it's not a huge
deal to double free it (just a warning, the free code catches this), it
can result in incorrectly freeing some other pertrans reservation in
this same reservation, which could then lead to spuriously granting
reservations we might not have the space for. Therefore, I do believe it
is worth fixing.

To fix it, use the existing prealloc->pertrans conversion mechanism.
When we first reserve the space, we reserve prealloc space and only when
we are sure we have a transaction do we convert it to pertrans. This way
any racing commits do not blow away our reservation, but we still get a
pertrans reservation that is freed when _this_ transaction gets committed.

This issue can be reproduced by running generic/269 with either qgroups
or squotas enabled via mkfs on the scratch device.

Reviewed-by: Josef Bacik 
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Boris Burkov 
Signed-off-by: David Sterba 
Signed-off-by: Greg Kroah-Hartman

btrfs: zoned: do not zone finish data relocation block group

2023-09-19T10:30:23+00:00

commit 332581bde2a419d5f12a93a1cdc2856af649a3cc upstream.

When multiple writes happen at once, we may need to sacrifice a currently
active block group to be zone finished for a new allocation. We choose a
block group with the least free space left, and zone finish it.

To do the finishing, we need to send IOs for already allocated region
and wait for them and on-going IOs. Otherwise, these IOs fail because the
zone is already finished at the time the IO reach a device.

However, if a block group dedicated to the data relocation is zone
finished, there is a chance that finishing it before an ongoing write IO
reaches the device. That is because there is timing gap between an
allocation is done (block_group->reservations == 0, as pre-allocation is
done) and an ordered extent is created when the relocation IO starts.
Thus, if we finish the zone between them, we can fail the IOs.

We cannot simply use "fs_info->data_reloc_bg == block_group->start" to
avoid the zone finishing. Because, the data_reloc_bg may already switch to
a new block group, while there are still ongoing write IOs to the old
data_reloc_bg.

So, this patch reworks the BLOCK_GROUP_FLAG_ZONED_DATA_RELOC bit to
indicate there is a data relocation allocation and/or ongoing write to the
block group. The bit is set on allocation and cleared in end_io function of
the last IO for the currently allocated region.

To change the timing of the bit setting also solves the issue that the bit
being left even after there is no IO going on. With the current code, if
the data_reloc_bg switches after the last IO to the current data_reloc_bg,
the bit is set at this timing and there is no one clearing that bit. As a
result, that block group is kept unallocatable for anything.

Fixes: 343d8a30851c ("btrfs: zoned: prevent allocation from previous data relocation BG")
Fixes: 74e91b12b115 ("btrfs: zoned: zone finish unused block group")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
Signed-off-by: Naohiro Aota 
Signed-off-by: David Sterba 
Signed-off-by: Greg Kroah-Hartman