linux.git/fs/btrfs/ordered-data.c, branch v6.16

btrfs: fold error checks when allocating ordered extent and update comments

2025-05-15T12:30:56+00:00

Instead of having an error check and return on each branch of the if
statement, move the error check to happen after that if branch, reducing
source code and object code sizes.

Before this change:

   $ size fs/btrfs/btrfs.ko
      text	   data	    bss	    dec	    hex	filename
   1840174	 163742	  16136	2020052	 1ed2d4	fs/btrfs/btrfs.ko

After this change:

   $ size fs/btrfs/btrfs.ko
      text	   data	    bss	    dec	    hex	filename
   1840138	 163742	  16136	2020016	 1ed2b0	fs/btrfs/btrfs.ko

While at it and moving the comments, update the comments to be more clear
about how qgroup reserved space is released and the intricacies of how
it's managed for COW writes.

Reviewed-by: Boris Burkov 
Signed-off-by: Filipe Manana 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba

btrfs: check we grabbed inode reference when allocating an ordered extent

2025-05-15T12:30:56+00:00

When allocating an ordered extent we call igrab() to get a reference on
the inode and attach it to the ordered extent. For an ordered extent we
always must have an inode reference since we during its life cycle we
need to access the inode for several things like for example:

* Inserting the ordered extent right after allocating it, when calling
  insert_ordered_extent() - we need to lock the inode's ordered_tree_lock;

* In the bio submission path we need to add checksums to the ordered
  extent and we end up at btrfs_add_ordered_sum(), where again we need
  to grab the inode from the ordered extent to lock the inode's
  ordered_tree_lock;

* When finishing an ordered extent, at btrfs_finish_ordered_extent(), we
  need again to access its inode in order to lock the inode's
  ordered_tree_lock;

* Etc etc etc.

Everywhere we deal with an ordered extent we always expect its inode to
be not NULL, the only exception being btrfs_put_ordered_extent() where
we check if it's NULL before calling btrfs_add_delayed_iput(), even though
we have already assumed it's not NULL when calling the tracepoint
trace_btrfs_ordered_extent_put() since the tracepoint dereferences the
inode to extract its number and root without ever checking it's NULL.

The igrab() call can return NULL if the inode is about to be freed or is
being freed (its state has I_FREEING or I_WILL_FREE set), and that's why
there's such check at btrfs_put_ordered_extent(). The igrab() and NULL
check were introduced in commit 5fd02043553b ("Btrfs: finish ordered
extents in their own thread") but even back then we always needed and
assumed igrab() returned a non-NULL pointer, since for example when
removing an ordered extent, at btrfs_remove_ordered_extent(), we assumed
the inode pointer was not NULL in order to access the inode's ordered
extent tree.

In fact whenever we allocate an ordered extent we are holding an inode
reference and the inode is not being freed or going to be freed (which
happens in the final iput), and since we depend on the inode for the
life cycle of the ordered extent, just make ordered extent allocation
to fail in case igrab() returns NULL and trigger a warning, to make it
clear it's not expected. This allows to remove the confusing NULL inode
check at btrfs_put_ordered_extent().

Reviewed-by: Boris Burkov 
Signed-off-by: Filipe Manana 
Signed-off-by: David Sterba

btrfs: fix qgroup reservation leak on failure to allocate ordered extent

2025-05-15T12:30:56+00:00

If we fail to allocate an ordered extent for a COW write we end up leaking
a qgroup data reservation since we called btrfs_qgroup_release_data() but
we didn't call btrfs_qgroup_free_refroot() (which would happen when
running the respective data delayed ref created by ordered extent
completion or when finishing the ordered extent in case an error happened).

So make sure we call btrfs_qgroup_free_refroot() if we fail to allocate an
ordered extent for a COW write.

Fixes: 7dbeaad0af7d ("btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov 
Reviewed-by: Qu Wenruo 
Signed-off-by: Filipe Manana 
Signed-off-by: David Sterba

btrfs: simplify csum list release at btrfs_put_ordered_extent()

2025-05-15T12:30:55+00:00

Instead of extracting each element by grabbing the list's first member in
a local list_head variable, then extracting the csum with list_entry() and
iterating with a while loop checking for list emptyness, use the iteration
helper list_for_each_entry_safe(). This also removes the need to delete
elements from the list with list_del() since the ordered extent is freed
immediately after.

Reviewed-by: Qu Wenruo 
Reviewed-by: Johannes Thumshirn 
Signed-off-by: Filipe Manana 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba

btrfs: use unsigned types for constants defined as bit shifts

2025-05-15T12:30:48+00:00

The unsigned type is a recommended practice (CWE-190, CWE-194) for bit
shifts to avoid problems with potential unwanted sign extensions.
Although there are no such cases in btrfs codebase, follow the
recommendation.

Reviewed-by: Boris Burkov 
Signed-off-by: David Sterba

btrfs: add btrfs prefix to main lock, try lock and unlock extent functions

2025-05-15T12:30:43+00:00

These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel. So add a prefix to
their name.

Reviewed-by: Johannes Thumshirn 
Signed-off-by: Filipe Manana 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba

btrfs: introduce a read path dedicated extent lock helper

2025-03-18T19:35:48+00:00

Currently we're using btrfs_lock_and_flush_ordered_range() for both
btrfs_read_folio() and btrfs_readahead(), but it has one critical
problem for future subpage optimizations:

- It will call btrfs_start_ordered_extent() to writeback the involved
  folios

  But remember we're calling btrfs_lock_and_flush_ordered_range() at
  read paths, meaning the folio is already locked by read path.

  If we really trigger writeback for those already locked folios, this
  will lead to a deadlock and writeback cannot get the folio lock.

  Such dead lock is prevented by the fact that btrfs always keeps a
  dirty folio also uptodate, by either dirtying all blocks of the folio,
  or by reading the whole folio before dirtying.

To prepare for the incoming patch which allows btrfs to skip full folio
read if the buffered write is block aligned, we have to start by solving
the possible deadlock first.

Instead of blindly calling btrfs_start_ordered_extent(), introduce a
new helper, which is smarter in the following ways:

- Only wait and flush the ordered extent if
  * The folio doesn't even have private bit set
  * Part of the blocks of the ordered extent are not uptodate

  This can happen by:
  * The folio writeback finished, then got invalidated.
    There are a lot of reasons that a folio can get invalidated,
    from memory pressure to direct IO (which invalidates all folios
    of the range).
    But OE not yet finished.

  We have to wait for the ordered extent, as the OE may contain
  to-be-inserted data checksum.
  Without waiting, our read can fail due to the missing checksum.

  But either way, the OE should not need any extra flush inside the
  locked folio range.

- Skip the ordered extent completely if
  * All the blocks are dirty
    This happens when OE creation is caused by a folio writeback whose
    file offset is before our folio.

    E.g. 16K page size and 4K block size

    0      8K      16K      24K     32K
    |//////////////||///////|       |

    The writeback of folio 0 created an OE for range [0, 24K), but since
    folio 16K is not fully uptodate, a read is triggered for folio 16K.

    The writeback will never happen (we're holding the folio lock for
    read), nor will the OE finish.

    Thus we must skip the range.

  * All the blocks are uptodate
    This happens when the writeback finished, but OE not yet finished.

    Since the blocks are already uptodate, we can skip the OE range.

The new helper lock_extents_for_read() will do a loop for the target
range by:

1) Lock the full range

2) If there is no ordered extent in the remaining range, exit

3) If there is an ordered extent that we can skip
   Skip to the end of the OE, and continue checking
   We do not trigger writeback nor wait for the OE.

4) If there is an ordered extent that we cannot skip
   Unlock the whole extent range and start the ordered extent.

And also update btrfs_start_ordered_extent() to add two more parameters:
@nowriteback_start and @nowriteback_len, to prevent triggering flush for
a certain range.

This will allow us to handle the following case properly in the future:

  16K page size, 4K btrfs block size:

  0     4K      8K     12K      16K      20K      24K     28K      32K
  |/////////////////////////////||////////////////|       |        |
  |<-------------------- OE 2 ------------------->|       |< OE 1 >|

  The folio has been written back before, thus we have an OE at
  [28K, 32K).
  Although the OE 1 finished its IO, the OE is not yet removed from IO
  tree.
  The folio got invalidated after writeback completed and before the
  ordered extent finished.

  And [16K, 24K) range is dirty and uptodate, caused by a block aligned
  buffered write (and future enhancements allowing btrfs to skip full
  folio read for such case).
  But writeback for folio 0 has began, thus it generated OE 2, covering
  range [0, 24K).

  Since the full folio 16K is not uptodate, if we want to read the folio,
  the existing btrfs_lock_and_flush_ordered_range() will dead lock, by:

  btrfs_read_folio()
  | Folio 16K is already locked
  |- btrfs_lock_and_flush_ordered_range()
     |- btrfs_start_ordered_extent() for range [16K, 24K)
        |- filemap_fdatawrite_range() for range [16K, 24K)
           |- extent_write_cache_pages()
              folio_lock() on folio 16K, deadlock.

  But now we will have the following sequence:

  btrfs_read_folio()
  | Folio 16K is already locked
  |- lock_extents_for_read()
     |- can_skip_ordered_extent() for range [16K, 24K)
     |  Returned true, the range [16K, 24K) will be skipped.
     |- can_skip_ordered_extent() for range [28K, 32K)
     |  Returned false.
     |- btrfs_start_ordered_extent() for range [28K, 32K) with
        [16K, 32K) as no writeback range
        No writeback for folio 16K will be triggered.

  And there will be no more possible deadlock on the same folio.

Reviewed-by: Filipe Manana 
Signed-off-by: Qu Wenruo 
Signed-off-by: David Sterba

btrfs: fix assertion failure when splitting ordered extent after transaction abort

2025-01-23T21:34:09+00:00

If while we are doing a direct IO write a transaction abort happens, we
mark all existing ordered extents with the BTRFS_ORDERED_IOERR flag (done
at btrfs_destroy_ordered_extents()), and then after that if we enter
btrfs_split_ordered_extent() and the ordered extent has bytes left
(meaning we have a bio that doesn't cover the whole ordered extent, see
details at btrfs_extract_ordered_extent()), we will fail on the following
assertion at btrfs_split_ordered_extent():

   ASSERT(!(flags & ~BTRFS_ORDERED_TYPE_FLAGS));

because the BTRFS_ORDERED_IOERR flag is set and the definition of
BTRFS_ORDERED_TYPE_FLAGS is just the union of all flags that identify the
type of write (regular, nocow, prealloc, compressed, direct IO, encoded).

Fix this by returning an error from btrfs_extract_ordered_extent() if we
find the BTRFS_ORDERED_IOERR flag in the ordered extent. The error will
be the error that resulted in the transaction abort or -EIO if no
transaction abort happened.

This was recently reported by syzbot with the following trace:

   FAULT_INJECTION: forcing a failure.
   name failslab, interval 1, probability 0, space 0, times 1
   CPU: 0 UID: 0 PID: 5321 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller #0
   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
   Call Trace:
    
    __dump_stack lib/dump_stack.c:94 [inline]
    dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
    fail_dump lib/fault-inject.c:53 [inline]
    should_fail_ex+0x3b0/0x4e0 lib/fault-inject.c:154
    should_failslab+0xac/0x100 mm/failslab.c:46
    slab_pre_alloc_hook mm/slub.c:4072 [inline]
    slab_alloc_node mm/slub.c:4148 [inline]
    __do_kmalloc_node mm/slub.c:4297 [inline]
    __kmalloc_noprof+0xdd/0x4c0 mm/slub.c:4310
    kmalloc_noprof include/linux/slab.h:905 [inline]
    kzalloc_noprof include/linux/slab.h:1037 [inline]
    btrfs_chunk_alloc_add_chunk_item+0x244/0x1100 fs/btrfs/volumes.c:5742
    reserve_chunk_space+0x1ca/0x2c0 fs/btrfs/block-group.c:4292
    check_system_chunk fs/btrfs/block-group.c:4319 [inline]
    do_chunk_alloc fs/btrfs/block-group.c:3891 [inline]
    btrfs_chunk_alloc+0x77b/0xf80 fs/btrfs/block-group.c:4187
    find_free_extent_update_loop fs/btrfs/extent-tree.c:4166 [inline]
    find_free_extent+0x42d1/0x5810 fs/btrfs/extent-tree.c:4579
    btrfs_reserve_extent+0x422/0x810 fs/btrfs/extent-tree.c:4672
    btrfs_new_extent_direct fs/btrfs/direct-io.c:186 [inline]
    btrfs_get_blocks_direct_write+0x706/0xfa0 fs/btrfs/direct-io.c:321
    btrfs_dio_iomap_begin+0xbb7/0x1180 fs/btrfs/direct-io.c:525
    iomap_iter+0x697/0xf60 fs/iomap/iter.c:90
    __iomap_dio_rw+0xeb9/0x25b0 fs/iomap/direct-io.c:702
    btrfs_dio_write fs/btrfs/direct-io.c:775 [inline]
    btrfs_direct_write+0x610/0xa30 fs/btrfs/direct-io.c:880
    btrfs_do_write_iter+0x2a0/0x760 fs/btrfs/file.c:1397
    do_iter_readv_writev+0x600/0x880
    vfs_writev+0x376/0xba0 fs/read_write.c:1050
    do_pwritev fs/read_write.c:1146 [inline]
    __do_sys_pwritev2 fs/read_write.c:1204 [inline]
    __se_sys_pwritev2+0x196/0x2b0 fs/read_write.c:1195
    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
    do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
    entry_SYSCALL_64_after_hwframe+0x77/0x7f
   RIP: 0033:0x7f1281f85d29
   RSP: 002b:00007f12819fe038 EFLAGS: 00000246 ORIG_RAX: 0000000000000148
   RAX: ffffffffffffffda RBX: 00007f1282176080 RCX: 00007f1281f85d29
   RDX: 0000000000000001 RSI: 0000000020000240 RDI: 0000000000000005
   RBP: 00007f12819fe090 R08: 0000000000000000 R09: 0000000000000003
   R10: 0000000000007000 R11: 0000000000000246 R12: 0000000000000002
   R13: 0000000000000000 R14: 00007f1282176080 R15: 00007ffcb9e23328
    
   BTRFS error (device loop0 state A): Transaction aborted (error -12)
   BTRFS: error (device loop0 state A) in btrfs_chunk_alloc_add_chunk_item:5745: errno=-12 Out of memory
   BTRFS info (device loop0 state EA): forced readonly
   assertion failed: !(flags & ~BTRFS_ORDERED_TYPE_FLAGS), in fs/btrfs/ordered-data.c:1234
   ------------[ cut here ]------------
   kernel BUG at fs/btrfs/ordered-data.c:1234!
   Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
   CPU: 0 UID: 0 PID: 5321 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller #0
   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
   RIP: 0010:btrfs_split_ordered_extent+0xd8d/0xe20 fs/btrfs/ordered-data.c:1234
   RSP: 0018:ffffc9000d1df2b8 EFLAGS: 00010246
   RAX: 0000000000000057 RBX: 000000000006a000 RCX: 9ce21886c4195300
   RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
   RBP: 0000000000000091 R08: ffffffff817f0a3c R09: 1ffff92001a3bdf4
   R10: dffffc0000000000 R11: fffff52001a3bdf5 R12: 1ffff1100a45f401
   R13: ffff8880522fa018 R14: dffffc0000000000 R15: 000000000006a000
   FS:  00007f12819fe6c0(0000) GS:ffff88801fc00000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 0000557750bd7da8 CR3: 00000000400ea000 CR4: 0000000000352ef0
   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
   Call Trace:
    
    btrfs_extract_ordered_extent fs/btrfs/direct-io.c:702 [inline]
    btrfs_dio_submit_io+0x4be/0x6d0 fs/btrfs/direct-io.c:737
    iomap_dio_submit_bio fs/iomap/direct-io.c:85 [inline]
    iomap_dio_bio_iter+0x1022/0x1740 fs/iomap/direct-io.c:447
    __iomap_dio_rw+0x13b7/0x25b0 fs/iomap/direct-io.c:703
    btrfs_dio_write fs/btrfs/direct-io.c:775 [inline]
    btrfs_direct_write+0x610/0xa30 fs/btrfs/direct-io.c:880
    btrfs_do_write_iter+0x2a0/0x760 fs/btrfs/file.c:1397
    do_iter_readv_writev+0x600/0x880
    vfs_writev+0x376/0xba0 fs/read_write.c:1050
    do_pwritev fs/read_write.c:1146 [inline]
    __do_sys_pwritev2 fs/read_write.c:1204 [inline]
    __se_sys_pwritev2+0x196/0x2b0 fs/read_write.c:1195
    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
    do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
    entry_SYSCALL_64_after_hwframe+0x77/0x7f
   RIP: 0033:0x7f1281f85d29
   RSP: 002b:00007f12819fe038 EFLAGS: 00000246 ORIG_RAX: 0000000000000148
   RAX: ffffffffffffffda RBX: 00007f1282176080 RCX: 00007f1281f85d29
   RDX: 0000000000000001 RSI: 0000000020000240 RDI: 0000000000000005
   RBP: 00007f12819fe090 R08: 0000000000000000 R09: 0000000000000003
   R10: 0000000000007000 R11: 0000000000000246 R12: 0000000000000002
   R13: 0000000000000000 R14: 00007f1282176080 R15: 00007ffcb9e23328
    
   Modules linked in:
   ---[ end trace 0000000000000000 ]---
   RIP: 0010:btrfs_split_ordered_extent+0xd8d/0xe20 fs/btrfs/ordered-data.c:1234
   RSP: 0018:ffffc9000d1df2b8 EFLAGS: 00010246
   RAX: 0000000000000057 RBX: 000000000006a000 RCX: 9ce21886c4195300
   RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
   RBP: 0000000000000091 R08: ffffffff817f0a3c R09: 1ffff92001a3bdf4
   R10: dffffc0000000000 R11: fffff52001a3bdf5 R12: 1ffff1100a45f401
   R13: ffff8880522fa018 R14: dffffc0000000000 R15: 000000000006a000
   FS:  00007f12819fe6c0(0000) GS:ffff88801fc00000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 0000557750bd7da8 CR3: 00000000400ea000 CR4: 0000000000352ef0
   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

In this case the transaction abort was due to (an injected) memory
allocation failure when attempting to allocate a new chunk.

Reported-by: syzbot+f60d8337a5c8e8d92a77@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6777f2dd.050a0220.178762.0045.GAE@google.com/
Fixes: 52b1fdca23ac ("btrfs: handle completed ordered extents in btrfs_split_ordered_extent")
Reviewed-by: Qu Wenruo 
Signed-off-by: Filipe Manana 
Signed-off-by: David Sterba

Merge tag 'cxl-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl

2024-11-22T20:33:52+00:00

Pull cxl updates from Dave Jiang:

 - Constify range_contains() input parameters to prevent changes

 - Add support for displaying RCD capabilities in sysfs to support lspci
   for CXL device

 - Downgrade warning message to debug in cxl_probe_component_regs()

 - Add support for adding a printf specifier '%pra' to emit 'struct
   range' content:
     - Add sanity tests for 'struct resource'
     - Add documentation for special case
     - Add %pra for 'struct range'
     - Add %pra usage in CXL code

 - Add preparation code for DCD support:
     - Add range_overlaps()
     - Add CDAT DSMAS table shared and read only flag in ACPICA
     - Add documentation to 'struct dev_dax_range'
     - Delay event buffer allocation in CXL PCI code until needed
     - Use guard() in cxl_dpa_set_mode()
     - Refactor create region code to consolidate common code

* tag 'cxl-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
  cxl/region: Refactor common create region code
  cxl/hdm: Use guard() in cxl_dpa_set_mode()
  cxl/pci: Delay event buffer allocation
  dax: Document struct dev_dax_range
  ACPI/CDAT: Add CDAT/DSMAS shared and read only flag values
  range: Add range_overlaps()
  cxl/cdat: Use %pra for dpa range outputs
  printf: Add print format (%pra) for struct range
  Documentation/printf: struct resource add start == end special case
  test printf: Add very basic struct resource tests
  cxl: downgrade a warning message to debug level in cxl_probe_component_regs()
  cxl/pci: Add sysfs attribute for CXL 1.1 device link status
  cxl/core/regs: Add rcd_pcie_cap initialization
  kernel/range: Const-ify range_contains parameters

range: Add range_overlaps()

2024-11-08T16:36:29+00:00

Code to support CXL Dynamic Capacity devices will have extent ranges
which need to be compared for intersection not a subset as is being
checked in range_contains().

range_overlaps() is defined in btrfs with a different meaning from what
is required in the standard range code.  Dan Williams pointed this out
in [1].  Adjust the btrfs call according to his suggestion there.

Then add a generic range_overlaps().

Cc: Dan Williams 
Cc: Chris Mason 
Cc: Josef Bacik 
Cc: David Sterba 
Cc: linux-btrfs@vger.kernel.org
Link: https://lore.kernel.org/all/65949f79ef908_8dc68294f2@dwillia2-xfh.jf.intel.com.notmuch/ [1]
Acked-by: David Sterba 
Reviewed-by: Davidlohr Bueso 
Reviewed-by: Johannes Thumshirn 
Reviewed-by: Fan Ni 
Reviewed-by: Dave Jiang 
Reviewed-by: Jonathan Cameron 
Signed-off-by: Ira Weiny 
Link: https://patch.msgid.link/20241107-dcd-type2-upstream-v7-1-56a84e66bc36@intel.com
Signed-off-by: Dave Jiang