linux.git/fs/ext4/file.c, branch v6.16

ext4: Add multi-fsblock atomic write support with bigalloc

2025-05-20T14:31:12+00:00

EXT4 supports bigalloc feature which allows the FS to work in size of
clusters (group of blocks) rather than individual blocks. This patch
adds atomic write support for bigalloc so that systems with bs = ps can
also create FS using -
mkfs.ext4 -F -O bigalloc -b 4096 -C 16384

With bigalloc ext4 can support multi-fsblock atomic writes. We will have to
adjust ext4's atomic write unit max value to cluster size. This can then support
atomic write of size anywhere between [blocksize, clustersize]. This
patch adds the required changes to enable multi-fsblock atomic write
support using bigalloc in the next patch.

In this patch for block allocation:
we first query the underlying region of the requested range by calling
ext4_map_blocks() call. Here are the various cases which we then handle
depending upon the underlying mapping type:
1. If the underlying region for the entire requested range is a mapped extent,
then we don't call ext4_map_blocks() to allocate anything. We don't need to
even start the jbd2 txn in this case.
2. For an append write case, we create a mapped extent.
3. If the underlying region is entirely a hole, then we create an unwritten
extent for the requested range.
4. If the underlying region is a large unwritten extent, then we split the
extent into 2 unwritten extent of required size.
5. If the underlying region has any type of mixed mapping, then we call
ext4_map_blocks() in a loop to zero out the unwritten and the hole regions
within the requested range. This then provide a single mapped extent type
mapping for the requested range.

Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO
flag only when the underlying extent mapping of the requested range is
not entirely a hole, an unwritten extent, or a fully mapped extent. That
is, if the underlying region contains a mix of hole(s), unwritten
extent(s), and mapped extent(s), we use this loop to ensure that all the
short mappings are zeroed out. This guarantees that the entire requested
range becomes a single, uniformly mapped extent. It is ok to do so
because we know this is being done on a bigalloc enabled filesystem
where the block bitmap represents the entire cluster unit.

Note having a single contiguous underlying region of type mapped,
unwrittn or hole is not a problem. But the reason to avoid writing on
top of mixed mapping region is because, atomic writes requires all or
nothing should get written for the userspace pwritev2 request. So if at
any point in time during the write if a crash or a sudden poweroff
occurs, the region undergoing atomic write should read either complete
old data or complete new data. But it should never have a mix of both
old and new data.
So, we first convert any mixed mapping region to a single contiguous
mapped extent before any data gets written to it. This is because
normally FS will only convert unwritten extents to written at the end of
the write in ->end_io() call. And if we allow the writes over a mixed
mapping and if a sudden power off happens in between, we will end up
reading mix of new data (over mapped extents) and old data (over
unwritten extents), because unwritten to written conversion never went
through.
So to avoid this and to avoid writes getting torned due to mixed
mapping, we first allocate a single contiguous block mapping and then
do the write.

Acked-by: Darrick J. Wong
Co-developed-by: Ojaswin Mujoo
Signed-off-by: Ojaswin Mujoo
Signed-off-by: Ritesh Harjani (IBM)
Link: https://patch.msgid.link/c4965ac3407cbc773f0bc954d0966d9696f5038a.1747337952.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o

ext4: factor out ext4_get_maxbytes()

2025-05-20T14:30:59+00:00

There are several locations that get the correct maxbytes value based on
the inode's block type. It would be beneficial to extract a common
helper function to make the code more clear.

Signed-off-by: Zhang Yi 
Reviewed-by: Jan Kara 
Reviewed-by: Baokun Li 
Link: https://patch.msgid.link/20250506012009.3896990-3-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o 
Cc: stable@kernel.org

Merge tag 'ext4-for_linus-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

2025-03-27T20:27:08+00:00

Pull ext4 updates from Ted Ts'o:
 "Ext4 bug fixes and cleanups, including:

   - hardening against maliciously fuzzed file systems

   - backwards compatibility for the brief period when we attempted to
     ignore zero-width characters

   - avoid potentially BUG'ing if there is a file system corruption
     found during the file system unmount

   - fix free space reporting by statfs when project quotas are enabled
     and the free space is less than the remaining project quota

  Also improve performance when replaying a journal with a very large
  number of revoke records (applicable for Lustre volumes)"

* tag 'ext4-for_linus-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (71 commits)
  ext4: fix OOB read when checking dotdot dir
  ext4: on a remount, only log the ro or r/w state when it has changed
  ext4: correct the error handle in ext4_fallocate()
  ext4: Make sb update interval tunable
  ext4: avoid journaling sb update on error if journal is destroying
  ext4: define ext4_journal_destroy wrapper
  ext4: hash: simplify kzalloc(n * 1, ...) to kzalloc(n, ...)
  jbd2: add a missing data flush during file and fs synchronization
  ext4: don't over-report free space or inodes in statvfs
  ext4: clear DISCARD flag if device does not support discard
  jbd2: remove jbd2_journal_unfile_buffer()
  ext4: reorder capability check last
  ext4: update the comment about mb_optimize_scan
  jbd2: fix off-by-one while erasing journal
  ext4: remove references to bh->b_page
  ext4: goto right label 'out_mmap_sem' in ext4_setattr()
  ext4: fix out-of-bound read in ext4_xattr_inode_dec_ref_all()
  ext4: introduce ITAIL helper
  jbd2: remove redundant function jbd2_journal_has_csum_v2or3_feature
  ext4: remove redundant function ext4_has_metadata_csum
  ...

Revert "ext4: add pre-content fsnotify hook for DAX faults"

2025-03-13T15:29:58+00:00

This reverts commit bb480760ffc7018e21ee6f60241c2b99ff26ee0e.

Signed-off-by: Amir Goldstein 
Signed-off-by: Jan Kara 
Link: https://patch.msgid.link/20250312073852.2123409-3-amir73il@gmail.com

ext4: add more ext4_emergency_state() checks around sb_rdonly()

2025-03-13T14:16:34+00:00

Some functions check sb_rdonly() to make sure the file system isn't
modified after it's read-only. Since we also don't want the file system
modified if it's in an emergency state (shutdown or emergency_ro),
we're adding additional ext4_emergency_state() checks where sb_rdonly()
is checked.

Suggested-by: Jan Kara 
Signed-off-by: Baokun Li 
Reviewed-by: Zhang Yi 
Reviewed-by: Jan Kara 
Link: https://patch.msgid.link/20250122114130.229709-5-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o

ext4: add ext4_emergency_state() helper function

2025-03-13T14:16:34+00:00

Since both SHUTDOWN and EMERGENCY_RO are emergency states of the ext4 file
system, and they are checked in similar locations, we have added a helper
function, ext4_emergency_state(), to determine whether the current file
system is in one of these two emergency states.

Then, replace calls to ext4_forced_shutdown() with ext4_emergency_state()
in those functions that could potentially trigger write operations.

Signed-off-by: Baokun Li 
Reviewed-by: Jan Kara 
Reviewed-by: Zhang Yi 
Link: https://patch.msgid.link/20250122114130.229709-4-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o

ext4: add pre-content fsnotify hook for DAX faults

2024-12-11T16:28:41+00:00

ext4 has its own handling for DAX faults. Add the pre-content fsnotify
hook for this case.

Signed-off-by: Jan Kara

Merge tag 'ext4_for_linus-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

2024-11-19T00:32:58+00:00

Pull ext4 updates from Ted Ts'o:
 "A lot of miscellaneous ext4 bug fixes and cleanups this cycle, most
  notably in the journaling code, bufered I/O, and compiler warning
  cleanups"

* tag 'ext4_for_linus-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (33 commits)
  jbd2: Fix comment describing journal_init_common()
  ext4: prevent an infinite loop in the lazyinit thread
  ext4: use struct_size() to improve ext4_htree_store_dirent()
  ext4: annotate struct fname with __counted_by()
  jbd2: avoid dozens of -Wflex-array-member-not-at-end warnings
  ext4: use str_yes_no() helper function
  ext4: prevent delalloc to nodelalloc on remount
  jbd2: make b_frozen_data allocation always succeed
  ext4: cleanup variable name in ext4_fc_del()
  ext4: use string choices helpers
  jbd2: remove the 'success' parameter from the jbd2_do_replay() function
  jbd2: remove useless 'block_error' variable
  jbd2: factor out jbd2_do_replay()
  jbd2: refactor JBD2_COMMIT_BLOCK process in do_one_pass()
  jbd2: unified release of buffer_head in do_one_pass()
  jbd2: remove redundant judgments for check v1 checksum
  ext4: use ERR_CAST to return an error-valued pointer
  mm: zero range of eof folio exposed by inode size extension
  ext4: partial zero eof block on unaligned inode size extension
  ext4: disambiguate the return value of ext4_dio_write_end_io()
  ...

ext4: disambiguate the return value of ext4_dio_write_end_io()

2024-11-13T04:54:14+00:00

The commit 91562895f803 ("ext4: properly sync file size update after O_SYNC
direct IO") causes confusion about the meaning of the return value of
ext4_dio_write_end_io().

Specifically, when the ext4_handle_inode_extension() operation succeeds,
ext4_dio_write_end_io() directly returns count instead of 0.

This does not cause a bug in the current kernel, but the semantics of the
return value of the ext4_dio_write_end_io() function are wrong, which is
likely to introduce bugs in the future code evolution.

Signed-off-by: Jinliang Zheng 
Reviewed-by: Zhang Yi 
Link: https://patch.msgid.link/20240919082539.381626-1-alexjlzheng@tencent.com
Signed-off-by: Theodore Ts'o

ext4: Do not fallback to buffered-io for DIO atomic write

2024-11-06T00:20:40+00:00

atomic writes is currently only supported for single fsblock and only
for direct-io. We should not return -ENOTBLK for atomic writes since we
want the atomic write request to either complete fully or fail
otherwise. Hence, we should never fallback to buffered-io in case of
DIO atomic write requests.
Let's also catch if this ever happens by adding some WARN_ON_ONCE before
buffered-io handling for direct-io atomic writes. More details of the
discussion [1].

While at it let's add an inline helper ext4_want_directio_fallback() which
simplifies the logic checks and inherently fixes condition on when to return
-ENOTBLK which otherwise was always returning true for any write or directio in
ext4_iomap_end(). It was ok since ext4 only supports direct-io via iomap.

[1]: https://lore.kernel.org/linux-xfs/cover.1729825985.git.ritesh.list@gmail.com/T/#m9dbecc11bed713ed0d7a486432c56b105b555f04
Suggested-by: Darrick J. Wong  # inline helper
Signed-off-by: Ritesh Harjani (IBM) 
Reviewed-by: Darrick J. Wong 
Signed-off-by: Darrick J. Wong 
Reviewed-by: Jan Kara