linux.git/fs/btrfs/ordered-data.c, branch v5.14

btrfs: store a block_device in struct btrfs_ordered_extent

2021-07-22T13:50:15+00:00

Store the block device instead of the gendisk in the btrfs_ordered_extent
structure instead of acquiring a reference to it later.

Note: this is from series removing bdgrab/bdput, btrfs is one of the
last users.

Signed-off-by: Christoph Hellwig 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba

btrfs: make page Ordered bit to be subpage compatible

2021-06-21T13:19:10+00:00

This involves the following modification:

- Ordered extent creation
  This is done in process_one_page(), now PAGE_SET_ORDERED will call
  subpage helper to do the work.

- endio functions
  This is done in btrfs_mark_ordered_io_finished().

- btrfs_invalidatepage()

- btrfs_cleanup_ordered_extents()
  Use the subpage page helper, and add an extra branch to exit if the
  locked page have covered the full range.

Now the usage of page Ordered flag for ordered extent accounting is fully
subpage compatible.

Tested-by: Ritesh Harjani  # [ppc64]
Tested-by: Anand Jain  # [aarch64]
Signed-off-by: Qu Wenruo 
Signed-off-by: David Sterba

btrfs: rename PagePrivate2 to PageOrdered inside btrfs

2021-06-21T13:19:09+00:00

Inside btrfs we use Private2 page status to indicate we have an ordered
extent with pending IO for the sector.

But the page status name, Private2, tells us nothing about the bit
itself, so this patch will rename it to Ordered.
And with extra comment about the bit added, so reader who is still
uncertain about the page Ordered status, will find the comment pretty
easily.

Reviewed-by: Josef Bacik 
Signed-off-by: Qu Wenruo 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba

btrfs: introduce btrfs_lookup_first_ordered_range()

2021-06-21T13:19:08+00:00

Although we already have btrfs_lookup_first_ordered_extent() and
btrfs_lookup_ordered_extent(), they all have their own limitations:

- btrfs_lookup_ordered_extent() can't do extra range check

  It's only designed to lookup any ordered extent before certain bytenr.

- btrfs_lookup_first_ordered_extent() may not return the first ordered
  extent in the range

  It doesn't ensure the first ordered extent is returned.
  The existing callers are only interested in exhausting all ordered
  extents in a range, the order is not important.

For incoming btrfs_invalidatepage() refactoring, we need a way to
properly iterate all ordered extents in their bytenr order of a range.

So this patch will introduce a new function,
btrfs_lookup_first_ordered_range(), to do ordered extent with bytenr
order awareness and extra range check.

Signed-off-by: Qu Wenruo 
Signed-off-by: David Sterba

btrfs: refactor how we finish ordered extent io for endio functions

2021-06-21T13:19:08+00:00

Btrfs has two endio functions to mark certain io range finished for
ordered extents:

- __endio_write_update_ordered()
  This is for direct IO

- btrfs_writepage_endio_finish_ordered()
  This for buffered IO.

However they go different routines to handle ordered extent io:

- Whether to iterate through all ordered extents
  __endio_write_update_ordered() will but
  btrfs_writepage_endio_finish_ordered() will not.

  In fact, iterating through all ordered extents will benefit later
  subpage support, while for current PAGE_SIZE == sectorsize requirement
  this behavior makes no difference.

- Whether to update page Private2 flag
  __endio_write_update_ordered() will not update page Private2 flag as
  for iomap direct IO, the page can not be even mapped.
  While btrfs_writepage_endio_finish_ordered() will clear Private2 to
  prevent double accounting against btrfs_invalidatepage().

Those differences are pretty subtle, and the ordered extent iterations
code in callers makes code much harder to read.

So this patch will introduce a new function,
btrfs_mark_ordered_io_finished(), to do the heavy lifting:

- Iterate through all ordered extents in the range
- Do the ordered extent accounting
- Queue the work for finished ordered extent

This function has two new feature:

- Proper underflow detection and recovery
  The old underflow detection will only detect the problem, then
  continue.
  No proper info like root/inode/ordered extent info, nor noisy enough
  to be caught by fstests.

  Furthermore when underflow happens, the ordered extent will never
  finish.

  New error detection will reset the bytes_left to 0, do proper
  kernel warning, and output extra info including root, ino, ordered
  extent range, the underflow value.

- Prevent double accounting based on Private2 flag
  Now if we find a range without Private2 flag, we will skip to next
  range.
  As that means someone else has already finished the accounting of
  ordered extent.

  This makes no difference for current code, but will be a critical part
  for incoming subpage support, as we can call
  btrfs_mark_ordered_io_finished() for multiple sectors if they are
  beyond inode size.
  Thus such double accounting prevention is a key feature for subpage.

Now both endio functions only need to call that new function.

And since the only caller of btrfs_dec_test_first_ordered_pending() is
removed, also remove btrfs_dec_test_first_ordered_pending() completely.

Signed-off-by: Qu Wenruo 
Signed-off-by: David Sterba

btrfs: zoned: fix silent data loss after failure splitting ordered extent

2021-04-28T18:09:38+00:00

On a zoned filesystem, sometimes we need to split an ordered extent into 3
different ordered extents. The original ordered extent is shortened, at
the front and at the rear, and we create two other new ordered extents to
represent the trimmed parts of the original ordered extent.

After adjusting the original ordered extent, we create an ordered extent
to represent the pre-range, and that may fail with ENOMEM for example.
After that we always try to create the ordered extent for the post-range,
and if that happens to succeed we end up returning success to the caller
as we overwrite the 'ret' variable which contained the previous error.

This means we end up with a file range for which there is no ordered
extent, which results in the range never getting a new file extent item
pointing to the new data location. And since the split operation did
not return an error, writeback does not fail and the inode's mapping is
not flagged with an error, resulting in a subsequent fsync not reporting
an error either.

It's possibly very unlikely to have the creation of the post-range ordered
extent succeed after the creation of the pre-range ordered extent failed,
but it's not impossible.

So fix this by making sure we only create the post-range ordered extent
if there was no error creating the ordered extent for the pre-range.

Fixes: d22002fd37bd97 ("btrfs: zoned: split ordered extent when bio is sent")
CC: stable@vger.kernel.org # 5.12+
Signed-off-by: Filipe Manana 
Signed-off-by: David Sterba

btrfs: replace offset_in_entry with in_range

2021-04-19T15:25:14+00:00

No point in duplicating the functionality just use the generic helper
that has the same semantics.

Signed-off-by: Nikolay Borisov 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba

btrfs: zoned: use ZONE_APPEND write for zoned mode

2021-02-09T01:46:06+00:00

Enable zone append writing for zoned mode. When using zone append, a
bio is issued to the start of a target zone and the device decides to
place it inside the zone. Upon completion the device reports the actual
written position back to the host.

Three parts are necessary to enable zone append mode. First, modify the
bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust the
bi_sector to point the beginning of the zone.

Second, record the returned physical address (and disk/partno) to the
ordered extent in end_bio_extent_writepage() after the bio has been
completed. We cannot resolve the physical address to the logical address
because we can neither take locks nor allocate a buffer in this end_bio
context. So, we need to record the physical address to resolve it later
in btrfs_finish_ordered_io().

And finally, rewrite the logical addresses of the extent mapping and
checksum data according to the physical address using btrfs_rmap_block.
If the returned address matches the originally allocated address, we can
skip this rewriting process.

Reviewed-by: Josef Bacik 
Signed-off-by: Johannes Thumshirn 
Signed-off-by: Naohiro Aota 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba

btrfs: save irq flags when looking up an ordered extent

2021-02-09T01:46:06+00:00

A following patch will add another caller of
btrfs_lookup_ordered_extent(), but from a bio's endio context.

btrfs_lookup_ordered_extent() uses spin_lock_irq() which unconditionally
disables interrupts. Change this to spin_lock_irqsave() so interrupts
aren't disabled and re-enabled unconditionally.

Reviewed-by: Josef Bacik 
Signed-off-by: Johannes Thumshirn 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba

btrfs: zoned: split ordered extent when bio is sent

2021-02-09T01:46:05+00:00

For a zone append write, the device decides the location the data is being
written to. Therefore we cannot ensure that two bios are written
consecutively on the device. In order to ensure that an ordered extent
maps to a contiguous region on disk, we need to maintain a "one bio ==
one ordered extent" rule.

Implement splitting of an ordered extent and extent map on bio submission
to adhere to the rule.

extract_ordered_extent() hooks into btrfs_submit_data_bio() and splits the
corresponding ordered extent so that the ordered extent's region fits into
one bio and the corresponding device limits.

Several sanity checks need to be done in extract_ordered_extent() e.g.

- We cannot split once end_bio'd ordered extent because we cannot divide
  ordered->bytes_left for the split ones
- We do not expect a compressed ordered extent
- We should not have checksum list because we omit the list splitting.
  Since the function is called before btrfs_wq_submit_bio() or
  btrfs_csum_one_bio(), this should be always ensured.

We also need to split an extent map by creating a new one. If not,
unpin_extent_cache() complains about the difference between the start of
the extent map and the file's logical offset.

Reviewed-by: Josef Bacik 
Signed-off-by: Naohiro Aota 
Signed-off-by: David Sterba