linux.git/fs/ext4/inode.c, branch v6.13

Merge tag 'ext4_for_linus-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

2024-11-19T00:32:58+00:00

Pull ext4 updates from Ted Ts'o:
 "A lot of miscellaneous ext4 bug fixes and cleanups this cycle, most
  notably in the journaling code, bufered I/O, and compiler warning
  cleanups"

* tag 'ext4_for_linus-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (33 commits)
  jbd2: Fix comment describing journal_init_common()
  ext4: prevent an infinite loop in the lazyinit thread
  ext4: use struct_size() to improve ext4_htree_store_dirent()
  ext4: annotate struct fname with __counted_by()
  jbd2: avoid dozens of -Wflex-array-member-not-at-end warnings
  ext4: use str_yes_no() helper function
  ext4: prevent delalloc to nodelalloc on remount
  jbd2: make b_frozen_data allocation always succeed
  ext4: cleanup variable name in ext4_fc_del()
  ext4: use string choices helpers
  jbd2: remove the 'success' parameter from the jbd2_do_replay() function
  jbd2: remove useless 'block_error' variable
  jbd2: factor out jbd2_do_replay()
  jbd2: refactor JBD2_COMMIT_BLOCK process in do_one_pass()
  jbd2: unified release of buffer_head in do_one_pass()
  jbd2: remove redundant judgments for check v1 checksum
  ext4: use ERR_CAST to return an error-valued pointer
  mm: zero range of eof folio exposed by inode size extension
  ext4: partial zero eof block on unaligned inode size extension
  ext4: disambiguate the return value of ext4_dio_write_end_io()
  ...

ext4: partial zero eof block on unaligned inode size extension

2024-11-13T04:54:14+00:00

Using mapped writes, it's technically possible to expose stale
post-eof data on a truncate up operation. Consider the following
example:

$ xfs_io -fc "pwrite 0 2k" -c "mmap 0 4k" -c "mwrite 2k 2k" \
	-c "truncate 8k" -c "pread -v 2k 16" 
...
00000800:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58  XXXXXXXXXXXXXXXX
...

This shows that the post-eof data written via mwrite lands within
EOF after a truncate up. While this is deliberate of the test case,
behavior is somewhat unpredictable because writeback does post-eof
zeroing, and writeback can occur at any time in the background. For
example, an fsync inserted between the mwrite and truncate causes
the subsequent read to instead return zeroes. This basically means
that there is a race window in this situation between any subsequent
extending operation and writeback that dictates whether post-eof
data is exposed to the file or zeroed.

To prevent this problem, perform partial block zeroing as part of
the various inode size extending operations that are susceptible to
it. For truncate extension, zero around the original eof similar to
how truncate down does partial zeroing of the new eof. For extension
via writes and fallocate related operations, zero the newly exposed
range of the file to cover any partial zeroing that must occur at
the original and new eof blocks.

Signed-off-by: Brian Foster 
Link: https://patch.msgid.link/20240919160741.208162-2-bfoster@redhat.com
Signed-off-by: Theodore Ts'o

ext4: fix race in buffer_head read fault injection

2024-11-13T04:54:14+00:00

When I enabled ext4 debug for fault injection testing, I encountered the
following warning:

  EXT4-fs error (device sda): ext4_read_inode_bitmap:201: comm fsstress:
         Cannot read inode bitmap - block_group = 8, inode_bitmap = 1051
  WARNING: CPU: 0 PID: 511 at fs/buffer.c:1181 mark_buffer_dirty+0x1b3/0x1d0

The root cause of the issue lies in the improper implementation of ext4's
buffer_head read fault injection. The actual completion of buffer_head
read and the buffer_head fault injection are not atomic, which can lead
to the uptodate flag being cleared on normally used buffer_heads in race
conditions.

[CPU0]           [CPU1]         [CPU2]
ext4_read_inode_bitmap
  ext4_read_bh()
  
                 ext4_read_inode_bitmap
                   if (buffer_uptodate(bh))
                     return bh
                               jbd2_journal_commit_transaction
                                 __jbd2_journal_refile_buffer
                                   __jbd2_journal_unfile_buffer
                                     __jbd2_journal_temp_unlink_buffer
  ext4_simulate_fail_bh()
    clear_buffer_uptodate
                                      mark_buffer_dirty
                                        
                                        WARN_ON_ONCE(!buffer_uptodate(bh))

The best approach would be to perform fault injection in the IO completion
callback function, rather than after IO completion. However, the IO
completion callback function cannot get the fault injection code in sb.

Fix it by passing the result of fault injection into the bh read function,
we simulate faults within the bh read function itself. This requires adding
an extra parameter to the bh read functions that need fault injection.

Fixes: 46f870d690fe ("ext4: simulate various I/O and checksum errors when reading metadata")
Signed-off-by: Long Li 
Link: https://patch.msgid.link/20240906091746.510163-1-leo.lilong@huawei.com
Signed-off-by: Theodore Ts'o

ext4: don't pass full mapping flags to ext4_es_insert_extent()

2024-11-13T04:54:14+00:00

When converting a delalloc extent in ext4_es_insert_extent(), since we
only want to pass the info of whether the quota has already been claimed
if the allocation is a direct allocation from ext4_map_create_blocks(),
there is no need to pass full mapping flags, so changes to just pass
whether the EXT4_GET_BLOCKS_DELALLOC_RESERVE bit is set.

Suggested-by: Jan Kara 
Signed-off-by: Zhang Yi 
Reviewed-by: Jan Kara 
Link: https://patch.msgid.link/20240906061401.2980330-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o

fs: ext4: Don't use CMA for buffer_head

2024-11-13T04:54:13+00:00

cma_alloc() keep failed in our system which thanks to a jh->bh->b_page
can not be migrated out of CMA area[1] as the jh has one cp_transaction
pending on it because of j_free > j_max_transaction_buffers[2][3][4][5][6].
We temporarily solve this by launching jbd2_log_do_checkpoint forcefully
somewhere. Since journal is common mechanism to all JFSs and
cp_transaction has a little fewer opportunity to be launched, the
cma_alloc() could be affected under the same scenario. This patch
would like to have buffer_head of ext4 not use CMA pages when doing
sb_getblk.

[1]
crash_arm64_v8.0.4++> kmem -p|grep ffffff808f0aa150(sb->s_bdev->bd_inode->i_mapping)
fffffffe01a51c00  e9470000 ffffff808f0aa150        3  2 8000000008020 lru,private
fffffffe03d189c0 174627000 ffffff808f0aa150        4  2 2004000000008020 lru,private
fffffffe03d88e00 176238000 ffffff808f0aa150      3f9  2 2008000000008020 lru,private
fffffffe03d88e40 176239000 ffffff808f0aa150        6  2 2008000000008020 lru,private
fffffffe03d88e80 17623a000 ffffff808f0aa150        5  2 2008000000008020 lru,private
fffffffe03d88ec0 17623b000 ffffff808f0aa150        1  2 2008000000008020 lru,private
fffffffe03d88f00 17623c000 ffffff808f0aa150        0  2 2008000000008020 lru,private
fffffffe040e6540 183995000 ffffff808f0aa150      3f4  2 2004000000008020 lru,private

[2] page -> buffer_head
crash_arm64_v8.0.4++> struct page.private fffffffe01a51c00 -x
      private = 0xffffff802fca0c00

[3] buffer_head -> journal_head
crash_arm64_v8.0.4++> struct buffer_head.b_private 0xffffff802fca0c00
  b_private = 0xffffff8041338e10,

[4] journal_head -> b_cp_transaction
crash_arm64_v8.0.4++> struct journal_head.b_cp_transaction 0xffffff8041338e10 -x
  b_cp_transaction = 0xffffff80410f1900,

[5] transaction_t -> journal
crash_arm64_v8.0.4++> struct transaction_t.t_journal 0xffffff80410f1900 -x
  t_journal = 0xffffff80e70f3000,

[6] j_free & j_max_transaction_buffers
crash_arm64_v8.0.4++> struct journal_t.j_free,j_max_transaction_buffers 0xffffff80e70f3000 -x
  j_free = 0x3f1,
  j_max_transaction_buffers = 0x100,

Suggested-by: Theodore Ts'o 
Signed-off-by: Zhaoyang Huang 
Reviewed-by: Jan Kara 
Link: https://patch.msgid.link/20240904075300.1148836-1-zhaoyang.huang@unisoc.com
Signed-off-by: Theodore Ts'o

ext4: Do not fallback to buffered-io for DIO atomic write

2024-11-06T00:20:40+00:00

atomic writes is currently only supported for single fsblock and only
for direct-io. We should not return -ENOTBLK for atomic writes since we
want the atomic write request to either complete fully or fail
otherwise. Hence, we should never fallback to buffered-io in case of
DIO atomic write requests.
Let's also catch if this ever happens by adding some WARN_ON_ONCE before
buffered-io handling for direct-io atomic writes. More details of the
discussion [1].

While at it let's add an inline helper ext4_want_directio_fallback() which
simplifies the logic checks and inherently fixes condition on when to return
-ENOTBLK which otherwise was always returning true for any write or directio in
ext4_iomap_end(). It was ok since ext4 only supports direct-io via iomap.

[1]: https://lore.kernel.org/linux-xfs/cover.1729825985.git.ritesh.list@gmail.com/T/#m9dbecc11bed713ed0d7a486432c56b105b555f04
Suggested-by: Darrick J. Wong  # inline helper
Signed-off-by: Ritesh Harjani (IBM) 
Reviewed-by: Darrick J. Wong 
Signed-off-by: Darrick J. Wong 
Reviewed-by: Jan Kara

ext4: Add statx support for atomic writes

2024-11-06T00:20:40+00:00

This patch adds base support for atomic writes via statx getattr.
On bs < ps systems, we can create FS with say bs of 16k. That means
both atomic write min and max unit can be set to 16k for supporting
atomic writes.

Co-developed-by: Ojaswin Mujoo 
Signed-off-by: Ojaswin Mujoo 
Signed-off-by: Ritesh Harjani (IBM) 
Reviewed-by: Darrick J. Wong 
Signed-off-by: Darrick J. Wong 
Reviewed-by: Jan Kara

Merge tag 'ext4_for_linus-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

2024-09-21T02:26:45+00:00

Pull ext4 updates from Ted Ts'o:
 "Lots of cleanups and bug fixes this cycle, primarily in the block
  allocation, extent management, fast commit, and journalling"

* tag 'ext4_for_linus-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (93 commits)
  ext4: convert EXT4_B2C(sbi->s_stripe) users to EXT4_NUM_B2C
  ext4: check stripe size compatibility on remount as well
  ext4: fix i_data_sem unlock order in ext4_ind_migrate()
  ext4: remove the special buffer dirty handling in do_journal_get_write_access
  ext4: fix a potential assertion failure due to improperly dirtied buffer
  ext4: hoist ext4_block_write_begin and replace the __block_write_begin
  ext4: persist the new uptodate buffers in ext4_journalled_zero_new_buffers
  ext4: dax: keep orphan list before truncate overflow allocated blocks
  ext4: fix error message when rejecting the default hash
  ext4: save unnecessary indentation in ext4_ext_create_new_leaf()
  ext4: make some fast commit functions reuse extents path
  ext4: refactor ext4_swap_extents() to reuse extents path
  ext4: get rid of ppath in convert_initialized_extent()
  ext4: get rid of ppath in ext4_ext_handle_unwritten_extents()
  ext4: get rid of ppath in ext4_ext_convert_to_initialized()
  ext4: get rid of ppath in ext4_convert_unwritten_extents_endio()
  ext4: get rid of ppath in ext4_split_convert_extents()
  ext4: get rid of ppath in ext4_split_extent()
  ext4: get rid of ppath in ext4_force_split_extent_at()
  ext4: get rid of ppath in ext4_split_extent_at()
  ...

ext4: remove the special buffer dirty handling in do_journal_get_write_access

2024-09-04T02:14:17+00:00

This kinda revert the commit 56d35a4cd13e("ext4: Fix dirtying of
journalled buffers in data=journal mode") made by Jan 14 years ago,
since the do_get_write_access() itself can deal with the extra
unexpected buf dirting things in a proper way now.

Suggested-by: Jan Kara 
Signed-off-by: Shida Zhang 
Reviewed-by: Jan Kara 
Link: https://patch.msgid.link/20240830053739.3588573-5-zhangshida@kylinos.cn
Signed-off-by: Theodore Ts'o

ext4: fix a potential assertion failure due to improperly dirtied buffer

2024-09-04T02:14:17+00:00

On an old kernel version(4.19, ext3, data=journal, pagesize=64k),
an assertion failure will occasionally be triggered by the line below:
-----------
jbd2_journal_commit_transaction
{
...
J_ASSERT_BH(bh, !buffer_dirty(bh));
/*
* The buffer on BJ_Forget list and not jbddirty means
...
}
-----------

The same condition may also be applied to the lattest kernel version.

When blocksize < pagesize and we truncate a file, there can be buffers in
the mapping tail page beyond i_size. These buffers will be filed to
transaction's BJ_Forget list by ext4_journalled_invalidatepage() during
truncation. When the transaction doing truncate starts committing, we can
grow the file again. This calls __block_write_begin() which allocates new
blocks under these buffers in the tail page we go through the branch:

                        if (buffer_new(bh)) {
                                clean_bdev_bh_alias(bh);
                                if (folio_test_uptodate(folio)) {
                                        clear_buffer_new(bh);
                                        set_buffer_uptodate(bh);
                                        mark_buffer_dirty(bh);
                                        continue;
                                }
                                ...
                        }

Hence buffers on BJ_Forget list of the committing transaction get marked
dirty and this triggers the jbd2 assertion.

Teach ext4_block_write_begin() to properly handle files with data
journalling by avoiding dirtying them directly. Instead of
folio_zero_new_buffers() we use ext4_journalled_zero_new_buffers() which
takes care of handling journalling. We also don't need to mark new uptodate
buffers as dirty in ext4_block_write_begin(). That will be either done
either by block_commit_write() in case of success or by
folio_zero_new_buffers() in case of failure.

Reported-by: Baolin Liu 
Suggested-by: Jan Kara 
Signed-off-by: Shida Zhang 
Reviewed-by: Jan Kara 
Link: https://patch.msgid.link/20240830053739.3588573-4-zhangshida@kylinos.cn
Signed-off-by: Theodore Ts'o