summaryrefslogtreecommitdiff
path: root/fs/ext4
AgeCommit message (Collapse)Author
2026-02-02ext4: move ->read_folio and ->readahead to readpage.cChristoph Hellwig
Keep all the read into pagecache code in a single file. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20260202060754.270269-4-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-01-29fsverity: start consolidating pagecache codeChristoph Hellwig
ext4 and f2fs are largely using the same code to read a page full of Merkle tree blocks from the page cache, and the upcoming xfs fsverity support would add another copy. Move the ext4 code to fs/verity/ and use it in f2fs as well. For f2fs this removes the previous f2fs-specific error injection, but otherwise the behavior remains unchanged. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Link: https://lore.kernel.org/r/20260128152630.627409-7-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-01-29fsverity: pass struct file to ->write_merkle_tree_blockChristoph Hellwig
This will make an iomap implementation of the method easier. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Acked-by: David Sterba <dsterba@suse.com> # btrfs Link: https://lore.kernel.org/r/20260128152630.627409-6-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-01-29ext4: don't build the fsverity work handler for !CONFIG_FS_VERITYChristoph Hellwig
Use IS_ENABLED to disable this code, leading to a slight size reduction: text data bss dec hex filename 4121 376 16 4513 11a1 fs/ext4/readpage.o.old 4030 328 16 4374 1116 fs/ext4/readpage.o Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Link: https://lore.kernel.org/r/20260128152630.627409-4-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-01-29fs,fsverity: clear out fsverity_info from common codeChristoph Hellwig
Free the fsverity_info directly in clear_inode instead of requiring file systems to handle it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Acked-by: David Sterba <dsterba@suse.com> # btrfs Link: https://lore.kernel.org/r/20260128152630.627409-3-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-01-29fs,fsverity: reject size changes on fsverity files in setattr_prepareChristoph Hellwig
Add the check to reject truncates of fsverity files directly to setattr_prepare instead of requiring the file system to handle it. Besides removing boilerplate code, this also fixes the complete lack of such check in btrfs. Fixes: 146054090b08 ("btrfs: initial fsverity support") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Link: https://lore.kernel.org/r/20260128152630.627409-2-hch@lst.de Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2026-01-23et4: allow zeroout when doing written to unwritten splitOjaswin Mujoo
Currently, when we are doing an extent split and convert operation of written to unwritten extent (example, as done by ZERO_RANGE), we don't allow the zeroout fallback in case the extent tree manipulation fails. This is mostly because zeroout might take unsually long and the fact that this code path is more tolerant to failures than endio. Since we have zeroout machinery in place, we might as well use it hence lift this restriction. To mitigate zeroout taking too long respect the max zeroout limit here so that the operation finishes relatively fast. Also, add kunit tests for this case. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/1c3349020b8e098a63f293b84bc8a9b56011cef4.1769149131.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-23ext4: refactor split and convert extentsOjaswin Mujoo
ext4_split_convert_extents() has been historically prone to subtle bugs and inconsistent behavior due to the way all the various flags interact with the extent split and conversion process. For example, callers like ext4_convert_unwritten_extents_endio() and convert_initialized_extents() needed to open code extent conversion despite passing CONVERT or CONVERT_UNWRITTEN flags because ext4_split_convert_extents() wasn't performing the conversion. Hence, refactor ext4_split_convert_extents() to clearly enforce the semantics of each flag. The major changes here are: * Clearly separate the split and convert process: * ext4_split_extent() and ext4_split_extent_at() are now only responsible to perform the split. * ext4_split_convert_extents() is now responsible to perform extent conversion after calling ext4_split_extent() for splitting. * This helps get rid of all the MARK_UNWRIT* flags. * Clearly enforce the semantics of flags passed to ext4_split_convert_extents(): * EXT4_GET_BLOCKS_CONVERT: Will convert the split extent to written * EXT4_GET_BLOCKS_CONVERT_UNWRITTEN: Will convert the split extent to unwritten * Modify all callers to enforce the above semantics. * Use ext4_split_convert_extents() instead of ext4_split_extents() in ext4_ext_convert_to_initialized() for uniformity. * Now that ext4_split_convert_extents() is handling caching to es, we dont need to do it in ext4_split_extent_zeroout(). * Cleanup all callers open coding the conversion logic. Further, modify kuniy tests to pass flags based on the new semantics. >From an end user point of view, we should not see any changes in behavior of ext4. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/2084a383d69ceefbaa293b8fcf725365eca0a349.1769149131.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-23ext4: refactor zeroout path and handle all casesOjaswin Mujoo
Currently, zeroout is used as a fallback in case we fail to split/convert extents in the "traditional" modify-the-extent-tree way. This is essential to mitigate failures in critical paths like extent splitting during endio. However, the logic is very messy and not easy to follow. Further, the fragile use of various flags has made it prone to errors. Refactor zeroout out logic by moving it up to ext4_split_extents(). Further, zeroout correctly based on the type of conversion we want, ie: - unwritten to written: Zeroout everything around the mapped range. - written to unwritten: Zeroout only the mapped range. Also, ext4_ext_convert_to_initialized() now passes EXT4_GET_BLOCKS_CONVERT to make the intention clear. Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/e1b51dedeca7c0b1f702141d91edfe4230560e7b.1769149131.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-23ext4: propagate flags to ext4_convert_unwritten_extents_endio()Ojaswin Mujoo
Currently, callers like ext4_convert_unwritten_extents() pass EXT4_EX_NOCACHE flag to avoid caching extents however this is not respected by ext4_convert_unwritten_extents_endio(). Hence, modify it to accept flags from the caller and to pass the flags on to other extent manipulation functions it calls. This makes sure the NOCACHE flag is respected throughout the code path. Also, since the caller already passes METADATA_NOFAIL and CONVERT flags we don't need to explicitly pass it anymore. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/7c2139e0ad32c49c19b194f72219e15d613de284.1769149131.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-23ext4: propagate flags to convert_initialized_extent()Ojaswin Mujoo
Currently, ext4_zero_range passes EXT4_EX_NOCACHE flag to avoid caching extents however this is not respected by convert_initialized_extent(). Hence, modify it to accept flags from the caller and to pass the flags on to other extent manipulation functions it calls. This makes sure the NOCACHE flag is respected throughout the code path. Also, we no longer explicitly pass CONVERT_UNWRITTEN as the caller takes care of this. Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/07008fbb14db727fddcaf4c30e2346c49f6c8fe0.1769149131.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-23ext4: add extent status cache support to kunit testsOjaswin Mujoo
Add support in Kunit tests to ensure that the extent status cache is also in sync after the extent split and conversion operations. Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/5f9d2668feeb89a3f3e9d03dadab8c10cbea3741.1769149131.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-23ext4: kunit tests for higher level extent manipulation functionsOjaswin Mujoo
Add more kunit tests to cover the high level caller ext4_map_create_blocks(). We pass flags in a manner that covers the below function: 1. ext4_ext_handle_unwritten_extents() 1.1 - Split/Convert unwritten extent to written in endio convtext. 1.2 - Split/Convert unwritten extent to written in non endio context. 1.3 - Zeroout tests for the above 2 cases 2. convert_initialized_extent() - Convert written extent to unwritten during zero range Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/9d8ad32cb62f44999c0fe3545b44fc3113546c70.1769149131.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-23ext4: kunit tests for extent splitting and conversionOjaswin Mujoo
Add multiple KUnit tests to test various permutations of extent splitting and conversion. We test the following cases: 1. Split of unwritten extent into 2 parts and convert 1 part to written 2. Split of unwritten extent into 3 parts and convert 1 part to written 3. Split of written extent into 2 parts and convert 1 part to unwritten 4. Split of written extent into 3 parts and convert 1 part to unwritten 5. Zeroout fallback for all the above cases except 3-4 because zeroout is not supported for written to unwritten splits The main function we test here is ext4_split_convert_extents(). Currently some of the tests are failing due to issues in implementation. All failures are mitigated at other layers in ext4 [1] but still point out the mismatch in expectation of what the caller wants vs what the function does. The aim is to eventually fix all the failures we see here. More detailed implementation notes can be found in the topmost commit in the test file. [1] for example, EXT4_GET_BLOCKS_CONVERT doesn't really convert the split extent to written, but rather the callers end up doing the conversion. Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/22bb9d17cd88c1318a2edde48887ca7488cb8a13.1769149131.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-20mm/block/fs: remove laptop_modeJohannes Weiner
Laptop mode was introduced to save battery, by delaying and consolidating writes and thereby maximize the time rotating hard drives wouldn't have to spin. Luckily, rotating hard drives, with their high spin-up times and power draw, are a thing of the past for battery-powered devices. Reclaim has also since changed to not write single filesystem pages anymore, and regular filesystem writeback is lumpy by design. The juice doesn't appear worth the squeeze anymore. The footprint of the feature is small, but nevertheless it's a complicating factor in mm, block, filesystems. Developers don't think about it, and it likely hasn't been tested with new reclaim and writeback changes in years. Let's sunset it. Keep the sysctl with a deprecation warning around for a few more cycles, but remove all functionality behind it. [akpm@linux-foundation.org: fix Documentation/admin-guide/laptops/index.rst] Link: https://lkml.kernel.org/r/20251216185201.GH905277@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-19ext4: use optimized mballoc scanning regardless of inode formatJan Kara
Currently we don't used mballoc optimized scanning (using max free extent order and avg free extent order group lists) for inodes with indirect block based format. This is confusing for users and I don't see a good reason for that. Even with indirect block based inode format we can spend big amount of time searching for free blocks for large filesystems with fragmented free space. To add to the confusion before commit 077d0c2c78df ("ext4: make mb_optimize_scan performance mount option work with extents") optimized scanning was applied *only* to indirect block based inodes so that commit appears as a performance regression to some users. Just use optimized scanning whenever it is enabled by mount options. Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org Link: https://patch.msgid.link/20260114182836.14120-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19ext4: always allocate blocks only from groups inode can useJan Kara
For filesystems with more than 2^32 blocks inodes using indirect block based format cannot use blocks beyond the 32-bit limit. ext4_mb_scan_groups_linear() takes care to not select these unsupported groups for such inodes however other functions selecting groups for allocation don't. So far this is harmless because the other selection functions are used only with mb_optimize_scan and this is currently disabled for inodes with indirect blocks however in the following patch we want to enable mb_optimize_scan regardless of inode format. Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Pedro Falcato <pfalcato@suse.de> Cc: stable@kernel.org Link: https://patch.msgid.link/20260114182836.14120-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19ext4: fix dirtyclusters double decrement on fs shutdownBrian Foster
fstests test generic/388 occasionally reproduces a warning in ext4_put_super() associated with the dirty clusters count: WARNING: CPU: 7 PID: 76064 at fs/ext4/super.c:1324 ext4_put_super+0x48c/0x590 [ext4] Tracing the failure shows that the warning fires due to an s_dirtyclusters_counter value of -1. IOW, this appears to be a spurious decrement as opposed to some sort of leak. Further tracing of the dirty cluster count deltas and an LLM scan of the resulting output identified the cause as a double decrement in the error path between ext4_mb_mark_diskspace_used() and the caller ext4_mb_new_blocks(). First, note that generic/388 is a shutdown vs. fsstress test and so produces a random set of operations and shutdown injections. In the problematic case, the shutdown triggers an error return from the ext4_handle_dirty_metadata() call(s) made from ext4_mb_mark_context(). The changed value is non-zero at this point, so ext4_mb_mark_diskspace_used() does not exit after the error bubbles up from ext4_mb_mark_context(). Instead, the former decrements both cluster counters and returns the error up to ext4_mb_new_blocks(). The latter falls into the !ar->len out path which decrements the dirty clusters counter a second time, creating the inconsistency. To avoid this problem and simplify ownership of the cluster reservation in this codepath, lift the counter reduction to a single place in the caller. This makes it more clear that ext4_mb_new_blocks() is responsible for acquiring cluster reservation (via ext4_claim_free_clusters()) in the !delalloc case as well as releasing it, regardless of whether it ends up consumed or returned due to failure. Fixes: 0087d9fb3f29 ("ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc") Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20260113171905.118284-1-bfoster@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2026-01-19ext4: fast commit: make s_fc_lock reclaim-safeLi Chen
s_fc_lock can be acquired from inode eviction and thus is reclaim unsafe. Since the fast commit path holds s_fc_lock while writing the commit log, allocations under the lock can enter reclaim and invert the lock order with fs_reclaim. Add ext4_fc_lock()/ext4_fc_unlock() helpers which acquire s_fc_lock under memalloc_nofs_save()/restore() context and use them everywhere so allocations under the lock cannot recurse into filesystem reclaim. Fixes: 6593714d67ba ("ext4: hold s_fc_lock while during fast commit") Signed-off-by: Li Chen <me@linux.beauty> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260106120621.440126-1-me@linux.beauty Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19ext4: fix e4b bitmap inconsistency reportsYongjian Sun
A bitmap inconsistency issue was observed during stress tests under mixed huge-page workloads. Ext4 reported multiple e4b bitmap check failures like: ext4_mb_complex_scan_group:2508: group 350, 8179 free clusters as per group info. But got 8192 blocks Analysis and experimentation confirmed that the issue is caused by a race condition between page migration and bitmap modification. Although this timing window is extremely narrow, it is still hit in practice: folio_lock ext4_mb_load_buddy __migrate_folio check ref count folio_mc_copy __filemap_get_folio folio_try_get(folio) ...... mb_mark_used ext4_mb_unload_buddy __folio_migrate_mapping folio_ref_freeze folio_unlock The root cause of this issue is that the fast path of load_buddy only increments the folio's reference count, which is insufficient to prevent concurrent folio migration. We observed that the folio migration process acquires the folio lock. Therefore, we can determine whether to take the fast path in load_buddy by checking the lock status. If the folio is locked, we opt for the slow path (which acquires the lock) to close this concurrency window. Additionally, this change addresses the following issues: When the DOUBLE_CHECK macro is enabled to inspect bitmap-related issues, the following error may be triggered: corruption in group 324 at byte 784(6272): f in copy != ff on disk/prealloc Analysis reveals that this is a false positive. There is a specific race window where the bitmap and the group descriptor become momentarily inconsistent, leading to this error report: ext4_mb_load_buddy ext4_mb_load_buddy __filemap_get_folio(create|lock) folio_lock ext4_mb_init_cache folio_mark_uptodate __filemap_get_folio(no lock) ...... mb_mark_used mb_mark_used_double mb_cmp_bitmaps mb_set_bits(e4b->bd_bitmap) folio_unlock The original logic assumed that since mb_cmp_bitmaps is called when the bitmap is newly loaded from disk, the folio lock would be sufficient to prevent concurrent access. However, this overlooks a specific race condition: if another process attempts to load buddy and finds the folio is already in an uptodate state, it will immediately begin using it without holding folio lock. Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260106090820.836242-1-sunyongjian@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2026-01-19ext4: remove redundant NULL check after __GFP_NOFAILBaolin Liu
Remove redundant NULL check after kcalloc() with GFP_NOFS | __GFP_NOFAIL. Signed-off-by: Baolin Liu <liubaolin@kylinos.cn> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20260106062016.154573-1-liubaolin12138@163.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19ext4: remove EXT4_GET_BLOCKS_IO_CREATE_EXTZhang Yi
We do not use EXT4_GET_BLOCKS_IO_CREATE_EXT or split extents before submitting I/O; therefore, remove the related code. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20260105014522.1937690-8-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19ext4: simplify the mapping query logic in ext4_iomap_begin()Zhang Yi
In the write path mapping check of ext4_iomap_begin(), the return value 'ret' should never greater than orig_mlen. If 'ret' equals 'orig_mlen', it can be returned directly without checking IOMAP_ATOMIC. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20260105014522.1937690-7-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19ext4: remove unused unwritten parameter in ext4_dio_write_iter()Zhang Yi
The parameter unwritten in ext4_dio_write_iter() is no longer needed, simply remove it. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20260105014522.1937690-6-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19ext4: remove useless ext4_iomap_overwrite_opsZhang Yi
ext4_iomap_overwrite_ops was introduced in commit 8cd115bdda17 ("ext4: Optimize ext4 DIO overwrites"), which can optimize pure overwrite performance by dropping the IOMAP_WRITE flag to only query the mapped mapping information. This avoids starting a new journal handle, thereby improving speed. Later, commit 9faac62d4013 ("ext4: optimize file overwrites") also optimized similar scenarios, but it performs the check later, examining the mappings status only when the actual block mapping is needed. Thus, it can handle the previous commit scenario. That means in the case of an overwrite scenario, the condition "offset + length <= i_size_read(inode)" in the write path must always be true. Therefore, it is acceptable to remove the ext4_iomap_overwrite_ops, which will also clarify the write and read paths of ext4_iomap_begin. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20260105014522.1937690-5-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19ext4: avoid starting handle when dio writing an unwritten extentZhang Yi
Since we have deferred the split of the unwritten extent until after I/O completion, it is not necessary to initiate the journal handle when submitting the I/O. This can improve the write performance of concurrent DIO for multiple files. The fio tests below show a ~25% performance improvement when wirting to unwritten files on my VM with a mem disk. [unwritten] direct=1 ioengine=psync numjobs=16 rw=write # write/randwrite bs=4K iodepth=1 directory=/mnt size=5G runtime=30s overwrite=0 norandommap=1 fallocate=native ramp_time=5s group_reporting=1 [w/o] w: IOPS=62.5k, BW=244MiB/s rw: IOPS=56.7k, BW=221MiB/s [w] w: IOPS=79.6k, BW=311MiB/s rw: IOPS=70.2k, BW=274MiB/s Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20260105014522.1937690-4-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19ext4: don't split extent before submitting I/OZhang Yi
Currently, when writing back dirty pages to the filesystem with the dioread_nolock feature enabled and when doing DIO, if the area to be written back is part of an unwritten extent, the EXT4_GET_BLOCKS_IO_CREATE_EXT flag is set during block allocation before submitting I/O. The function ext4_split_convert_extents() then attempts to split this extent in advance. This approach is designed to prevents extent splitting and conversion to the written type from failing due to insufficient disk space at the time of I/O completion, which could otherwise result in data loss. However, we already have two mechanisms to ensure successful extent conversion. The first is the EXT4_GET_BLOCKS_METADATA_NOFAIL flag, which is a best effort, it permits the use of 2% of the reserved space or 4,096 blocks in the file system when splitting extents. This flag covers most scenarios where extent splitting might fail. The second is the EXT4_EXT_MAY_ZEROOUT flag, which is also set during extent splitting. If the reserved space is insufficient and splitting fails, it does not retry the allocation. Instead, it directly zeros out the extra part of the extent, thereby avoiding splitting and directly converting the entire extent to the written type. These two mechanisms also exist when I/Os are completed because there is a concurrency window between write-back and fallocate, which may still require us to split extents upon I/O completion. There is no much difference between splitting extents before submitting I/O. Therefore, It seems possible to defer the splitting until I/O completion, it won't increase the risk of I/O failure and data loss. On the contrary, if some I/Os can be merged when I/O completion, it can also reduce unnecessary splitting operations, thereby alleviating the pressure on reserved space. In addition, deferring extent splitting until I/O completion can also simplify the IO submission process and avoid initiating unnecessary journal handles when writing unwritten extents. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20260105014522.1937690-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19ext4: use reserved metadata blocks when splitting extent on endioZhang Yi
When performing buffered writes, we may need to split and convert an unwritten extent into a written one during the end I/O process. However, we do not reserve space specifically for these metadata changes, we only reserve 2% of space or 4096 blocks. To address this, we use EXT4_GET_BLOCKS_PRE_IO to potentially split extents in advance and EXT4_GET_BLOCKS_METADATA_NOFAIL to utilize reserved space if necessary. These two approaches can reduce the likelihood of running out of space and losing data. However, these methods are merely best efforts, we could still run out of space, and there is not much difference between converting an extent during the writeback process and the end I/O process, it won't increase the risk of losing data if we postpone the conversion. Therefore, also use EXT4_GET_BLOCKS_METADATA_NOFAIL in ext4_convert_unwritten_extents_endio() to prepare for the buffered I/O iomap conversion, which may perform extent conversion during the end I/O process. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20260105014522.1937690-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19ext4: fix memory leak in ext4_ext_shift_extents()Zilin Guan
In ext4_ext_shift_extents(), if the extent is NULL in the while loop, the function returns immediately without releasing the path obtained via ext4_find_extent(), leading to a memory leak. Fix this by jumping to the out label to ensure the path is properly released. Fixes: a18ed359bdddc ("ext4: always check ext4_ext_find_extent result") Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20251225084800.905701-1-zilin@seu.edu.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2026-01-19ext4: don't order data when zeroing unwritten or delayed blockZhang Yi
When zeroing out a written partial block, it is necessary to order the data to prevent exposing stale data on disk. However, if the buffer is unwritten or delayed, it is not allocated as written, so ordering the data is not required. This can prevent strange and unnecessary ordered writes when appending data across a region within a block. Assume we have a 2K unwritten file on a filesystem with 4K blocksize, and buffered write from 3K to 4K. Before this patch, __ext4_block_zero_page_range() would add the range [2k,3k) to the ordered range, and then the JBD2 commit process would write back this block. However, it does nothing since the block is not mapped as written, this folio will be redirtied and written back agian through the normal write back process. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20251223011927.34042-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19ext4: remove unnecessary zero-initialization via memsetpengdonglin
The d_path function does not require the caller to pre-zero the buffer. Signed-off-by: pengdonglin <pengdonglin@xiaomi.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://patch.msgid.link/20251211123829.2777009-1-dolinux.peng@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19ext4: mark group extend fast-commit ineligibleLi Chen
Fast commits only log operations that have dedicated replay support. EXT4_IOC_GROUP_EXTEND grows the filesystem to the end of the last block group and updates the same on-disk metadata without going through the fast commit tracking paths. In practice these operations are rare and usually followed by further updates, but mixing them into a fast commit makes the overall semantics harder to reason about and risks replay gaps if new call sites appear. Teach ext4 to mark the filesystem fast-commit ineligible when EXT4_IOC_GROUP_EXTEND grows the filesystem. This forces those transactions to fall back to a full commit, ensuring that the group extension changes are captured by the normal journal rather than partially encoded in fast commit TLVs. This change should not affect common workloads but makes online resize via GROUP_EXTEND safer and easier to reason about under fast commit. Testing: 1. prepare: dd if=/dev/zero of=/root/fc_resize.img bs=1M count=0 seek=256 mkfs.ext4 -O fast_commit -F /root/fc_resize.img mkdir -p /mnt/fc_resize && mount -t ext4 -o loop /root/fc_resize.img /mnt/fc_resize 2. Extended the filesystem to the end of the last block group using a helper that calls EXT4_IOC_GROUP_EXTEND on the mounted filesystem and checked fc_info: ./group_extend_helper /mnt/fc_resize cat /proc/fs/ext4/loop0/fc_info shows the "Resize" ineligible reason increased. 3. Fsynced a file on the resized filesystem and confirmed that the fast commit ineligible counter incremented for the resize transaction: touch /mnt/fc_resize/file /root/fsync_file /mnt/fc_resize/file sync cat /proc/fs/ext4/loop0/fc_info Signed-off-by: Li Chen <me@linux.beauty> Link: https://patch.msgid.link/20251211115146.897420-6-me@linux.beauty Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19ext4: mark group add fast-commit ineligibleLi Chen
Fast commits only log operations that have dedicated replay support. Online resize via EXT4_IOC_GROUP_ADD updates the superblock and group descriptor metadata without going through the fast commit tracking paths. In practice these operations are rare and usually followed by further updates, but mixing them into a fast commit makes the overall semantics harder to reason about and risks replay gaps if new call sites appear. Teach ext4 to mark the filesystem fast-commit ineligible when ext4_ioctl_group_add() adds new block groups. This forces those transactions to fall back to a full commit, ensuring that the filesystem geometry updates are captured by the normal journal rather than partially encoded in fast commit TLVs. This change should not affect common workloads but makes online resize via GROUP_ADD safer and easier to reason about under fast commit. Testing: 1. prepare: dd if=/dev/zero of=/root/fc_resize.img bs=1M count=0 seek=256 mkfs.ext4 -O fast_commit -F /root/fc_resize.img mkdir -p /mnt/fc_resize && mount -t ext4 -o loop /root/fc_resize.img /mnt/fc_resize 2. Ran a helper that issues EXT4_IOC_GROUP_ADD on the mounted filesystem and checked the resize ineligible reason: ./group_add_helper /mnt/fc_resize cat /proc/fs/ext4/loop0/fc_info shows "Resize": > 0. 3. Fsynced a file on the resized filesystem and verified that the fast commit stats report at least one ineligible commit: touch /mnt/fc_resize/file /root/fsync_file /mnt/fc_resize/file sync cat /proc/fs/ext4/loop0/fc_info shows fc stats ineligible > 0. Signed-off-by: Li Chen <me@linux.beauty> Link: https://patch.msgid.link/20251211115146.897420-5-me@linux.beauty Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19ext4: mark move extents fast-commit ineligibleLi Chen
Fast commits only log operations that have dedicated replay support. EXT4_IOC_MOVE_EXT swaps extents between regular files and may copy data, rewriting the affected inodes' block mapping layout without going through the fast commit tracking paths. In practice these operations are rare and usually followed by further updates, but mixing them into a fast commit makes the overall semantics harder to reason about and risks replay gaps if new call sites appear. Teach ext4 to mark the filesystem fast-commit ineligible for the journal transactions used by move_extent_per_page() when EXT4_IOC_MOVE_EXT runs. This forces those transactions to fall back to a full commit, ensuring that these multi-inode extent swaps are captured by the normal journal rather than partially encoded in fast commit TLVs. This change should not affect common workloads but makes online defragmentation safer and easier to reason about under fast commit. Testing: 1. prepare: dd if=/dev/zero of=/root/fc_move.img bs=1M count=0 seek=256 mkfs.ext4 -O fast_commit -F /root/fc_move.img mkdir -p /mnt/fc_move && mount -t ext4 -o loop \ /root/fc_move.img /mnt/fc_move 2. Created two files, ran EXT4_IOC_MOVE_EXT via e4defrag, and checked the ineligible reason statistics: fallocate -l 64M /mnt/fc_move/file1 cp /mnt/fc_move/file1 /mnt/fc_move/file2 e4defrag /mnt/fc_move/file1 cat /proc/fs/ext4/loop0/fc_info shows "Move extents": > 0 and fc stats ineligible > 0. Signed-off-by: Li Chen <me@linux.beauty> Link: https://patch.msgid.link/20251211115146.897420-4-me@linux.beauty Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19ext4: mark fs-verity enable fast-commit ineligibleLi Chen
Fast commits only log operations that have dedicated replay support. Enabling fs-verity builds a Merkle tree and updates inode and orphan state in ways that are not described by the fast commit replay tags. In practice these operations are rare and usually followed by further updates, but mixing them into a fast commit makes the overall semantics harder to reason about and risks replay gaps if new call sites appear. Teach ext4 to mark the filesystem fast-commit ineligible when ext4_end_enable_verity() starts its journal transaction. This forces that transaction to fall back to a full commit, ensuring that the fs-verity enable changes are captured by the normal journal rather than partially encoded in fast commit TLVs. This change should not affect common workloads but makes fs-verity enable safer and easier to reason about under fast commit. Testing: 1. prepare: dd if=/dev/zero of=/root/fc_verity.img bs=1M count=0 seek=128 mkfs.ext4 -O fast_commit,verity -F /root/fc_verity.img mkdir -p /mnt/fc_verity && mount -t ext4 -o loop /root/fc_verity.img /mnt/fc_verity 2. Enabled fs-verity on a file and verified reason accounting: echo "data" > /mnt/fc_verity/verityfile /root/enable_verity /mnt/fc_verity/verityfile sync tail -n 1 /proc/fs/ext4/loop0/fc_info "fs-verity enable": 1 3. Enabled fs-verity on a second file, fsynced it, and checked that the ineligible commit counter is updated too: echo "data2" > /mnt/fc_verity/verityfile2 /root/enable_verity /mnt/fc_verity/verityfile2 /root/fsync_file /mnt/fc_verity/verityfile2 sync /proc/fs/ext4/loop0/fc_info shows "fs-verity enable" incremented and fc stats ineligible increased accordingly. Signed-off-by: Li Chen <me@linux.beauty> Link: https://patch.msgid.link/20251211115146.897420-3-me@linux.beauty Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19ext4: mark inode format migration fast-commit ineligibleLi Chen
Fast commits only log operations that have dedicated replay support. Inode format migration (indirect<->extent layout changes via EXT4_IOC_MIGRATE or toggling EXT4_EXTENTS_FL) rewrites the block mapping representation without going through the fast commit tracking paths. In practice these migrations are rare and usually followed by further updates, but mixing them into a fast commit makes the overall semantics harder to reason about and risks replay gaps if new call sites appear. Teach ext4 to mark the filesystem fast-commit ineligible when ext4_ext_migrate() or ext4_ind_migrate() start their journal transactions. This forces those transactions to fall back to a full commit, ensuring that the entire inode layout change is captured by the normal journal rather than partially encoded in fast commit TLVs. This change should not affect common workloads but makes format migrations safer and easier to reason about under fast commit. Testing: 1. prepare: dd if=/dev/zero of=/root/fc.img bs=1M count=0 seek=128 mkfs.ext4 -O fast_commit -F /root/fc.img mkdir -p /mnt/fc && mount -t ext4 -o loop /root/fc.img /mnt/fc 2. Created a test file and toggled the extents flag to exercise both ext4_ind_migrate() and ext4_ext_migrate(): touch /mnt/fc/migtest chattr -e /mnt/fc/migtest chattr +e /mnt/fc/migtest 3. Verified fast-commit ineligible statistics: tail -n 1 /proc/fs/ext4/loop0/fc_info "Inode format migration": 2 Signed-off-by: Li Chen <me@linux.beauty> Link: https://patch.msgid.link/20251211115146.897420-2-me@linux.beauty Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19ext4: add sysfs attribute err_report_sec to control s_err_report timerBaolin Liu
Add a new sysfs attribute "err_report_sec" to control the s_err_report timer in ext4_sb_info. Writing '0' disables the timer, while writing a non-zero value enables the timer and sets the timeout in seconds. Signed-off-by: Baolin Liu <liubaolin@kylinos.cn> Link: https://patch.msgid.link/20251211030256.28613-1-liubaolin12138@163.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19ext4: move ext4_percpu_param_init() before ext4_mb_init()Baokun Li
When running `kvm-xfstests -c ext4/1k -C 1 generic/383` with the `DOUBLE_CHECK` macro defined, the following panic is triggered: ================================================================== EXT4-fs error (device vdc): ext4_validate_block_bitmap:423: comm mount: bg 0: bad block bitmap checksum BUG: unable to handle page fault for address: ff110000fa2cc000 PGD 3e01067 P4D 3e02067 PUD 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 0 UID: 0 PID: 2386 Comm: mount Tainted: G W 6.18.0-gba65a4e7120a-dirty #1152 PREEMPT(none) RIP: 0010:percpu_counter_add_batch+0x13/0xa0 Call Trace: <TASK> ext4_mark_group_bitmap_corrupted+0xcb/0xe0 ext4_validate_block_bitmap+0x2a1/0x2f0 ext4_read_block_bitmap+0x33/0x50 mb_group_bb_bitmap_alloc+0x33/0x80 ext4_mb_add_groupinfo+0x190/0x250 ext4_mb_init_backend+0x87/0x290 ext4_mb_init+0x456/0x640 __ext4_fill_super+0x1072/0x1680 ext4_fill_super+0xd3/0x280 get_tree_bdev_flags+0x132/0x1d0 vfs_get_tree+0x29/0xd0 vfs_cmd_create+0x59/0xe0 __do_sys_fsconfig+0x4f6/0x6b0 do_syscall_64+0x50/0x1f0 entry_SYSCALL_64_after_hwframe+0x76/0x7e ================================================================== This issue can be reproduced using the following commands: mkfs.ext4 -F -q -b 1024 /dev/sda 5G tune2fs -O quota,project /dev/sda mount /dev/sda /tmp/test With DOUBLE_CHECK defined, mb_group_bb_bitmap_alloc() reads and validates the block bitmap. When the validation fails, ext4_mark_group_bitmap_corrupted() attempts to update sbi->s_freeclusters_counter. However, this percpu_counter has not been initialized yet at this point, which leads to the panic described above. Fix this by moving the execution of ext4_percpu_param_init() to occur before ext4_mb_init(), ensuring the per-CPU counters are initialized before they are used. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20251209133116.731350-1-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-18ext4: drop the TODO comment in ext4_es_insert_extent()Zhang Yi
Now we have ext4_es_cache_extent() to cache on-disk extents instead of ext4_es_insert_extent(), so drop the TODO comment. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Message-ID: <20251129103247.686136-15-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-18ext4: replace ext4_es_insert_extent() when caching on-disk extentsZhang Yi
In ext4, the remaining places for inserting extents into the extent status tree within ext4_ext_determine_insert_hole() and ext4_map_query_blocks() directly cache on-disk extents. We can use ext4_es_cache_extent() instead of ext4_es_insert_extent() in these cases. This will help reduce unnecessary increases in extent sequence numbers and cache invalidations after supporting IOMAP in the future. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Message-ID: <20251129103247.686136-14-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-18ext4: adjust the debug info in ext4_es_cache_extent()Zhang Yi
Print a trace point after successfully inserting an extent in the ext4_es_cache_extent() function. Additionally, similar to other extent cache operation functions, call ext4_print_pending_tree() to display the extent debug information of the inode when in ES_DEBUG mode. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Message-ID: <20251129103247.686136-13-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-18ext4: make ext4_es_cache_extent() support overwrite existing extentsZhang Yi
Currently, ext4_es_cache_extent() is used to load extents into the extent status tree when reading on-disk extent blocks. But it inserts information into the extent status tree if and only if there isn't information about the specified range already. So it only used for the initial loading and does not support overwrit extents. However, there are many other places in ext4 where on-disk extents are inserted into the extent status tree, such as in ext4_map_query_blocks(). Currently, they call ext4_es_insert_extent() to perform the insertion, but they don't modify the extents, so ext4_es_cache_extent() would be a more appropriate choice. However, when ext4_map_query_blocks() inserts an extent, it may overwrite a short existing extent of the same type. Therefore, to prepare for the replacements, we need to extend ext4_es_cache_extent() to allow it to overwrite existing extents with the same status. So it checks the found extents before removing and inserting. (There is one exception, a hole in the on-disk extent but a delayed extent in the extent status tree is allowed.) In addition, since cached extents can be more lenient than the extents they modify and do not involve modifying reserved blocks, it is not necessary to ensure that the insertion operation succeeds as strictly as in the ext4_es_insert_extent() function. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Message-ID: <20251129103247.686136-12-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-18ext4: make __es_remove_extent() check extent statusZhang Yi
Currently, __es_remove_extent() unconditionally removes extent status entries within the specified range. In order to prepare for extending the ext4_es_cache_extent() function to cache on-disk extents, which may overwrite some existing short-length extents with the same status, allow __es_remove_extent() to check the specified extent type before removing it, and return error and pass out the conflicting extent if the status does not match. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Message-ID: <20251129103247.686136-11-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-18ext4: cleanup useless out label in __es_remove_extent()Zhang Yi
The out label in __es_remove_extent() is just return err value, we can return it directly if something bad happens. Therefore, remove the useless out label and rename out_get_reserved to out. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Message-ID: <20251129103247.686136-10-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-18ext4: cleanup zeroout in ext4_split_extent_at()Zhang Yi
zero_ex is a temporary variable used only for writing zeros and inserting extent status entry, it will not be directly inserted into the tree. Therefore, it can be assigned values from the target extent in various scenarios, eliminating the need to explicitly assign values to each variable individually. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Message-ID: <20251129103247.686136-9-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-18ext4: drop extent cache when splitting extent failsZhang Yi
When the split extent fails, we might leave some extents still being processed and return an error directly, which will result in stale extent entries remaining in the extent status tree. So drop all of the remaining potentially stale extents if the splitting fails. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251129103247.686136-8-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-18ext4: drop extent cache after doing PARTIAL_VALID1 zerooutZhang Yi
When splitting an unwritten extent in the middle and converting it to initialized in ext4_split_extent() with the EXT4_EXT_MAY_ZEROOUT and EXT4_EXT_DATA_VALID2 flags set, it could leave a stale unwritten extent. Assume we have an unwritten file and buffered write in the middle of it without dioread_nolock enabled, it will allocate blocks as written extent. 0 A B N [UUUUUUUUUUUU] on-disk extent U: unwritten extent [UUUUUUUUUUUU] extent status tree [--DDDDDDDD--] D: valid data |<- ->| ----> this range needs to be initialized ext4_split_extent() first try to split this extent at B with EXT4_EXT_DATA_PARTIAL_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but ext4_split_extent_at() failed to split this extent due to temporary lack of space. It zeroout B to N and leave the entire extent as unwritten. 0 A B N [UUUUUUUUUUUU] on-disk extent [UUUUUUUUUUUU] extent status tree [--DDDDDDDDZZ] Z: zeroed data ext4_split_extent() then try to split this extent at A with EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and leave an written extent from A to N. 0 A B N [UUWWWWWWWWWW] on-disk extent W: written extent [UUUUUUUUUUUU] extent status tree [--DDDDDDDDZZ] Finally ext4_map_create_blocks() only insert extent A to B to the extent status tree, and leave an stale unwritten extent in the status tree. 0 A B N [UUWWWWWWWWWW] on-disk extent W: written extent [UUWWWWWWWWUU] extent status tree [--DDDDDDDDZZ] Fix this issue by always cached extent status entry after zeroing out the second part. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251129103247.686136-7-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-18ext4: don't cache extent during splitting extentZhang Yi
Caching extents during the splitting process is risky, as it may result in stale extents remaining in the status tree. Moreover, in most cases, the corresponding extent block entries are likely already cached before the split happens, making caching here not particularly useful. Assume we have an unwritten extent, and then DIO writes the first half. [UUUUUUUUUUUUUUUU] on-disk extent U: unwritten extent [UUUUUUUUUUUUUUUU] extent status tree |<- ->| ----> dio write this range First, when ext4_split_extent_at() splits this extent, it truncates the existing extent and then inserts a new one. During this process, this extent status entry may be shrunk, and calls to ext4_find_extent() and ext4_cache_extents() may occur, which could potentially insert the truncated range as a hole into the extent status tree. After the split is completed, this hole is not replaced with the correct status. [UUUUUUU|UUUUUUUU] on-disk extent U: unwritten extent [UUUUUUU|HHHHHHHH] extent status tree H: hole Then, the outer calling functions will not correct this remaining hole extent either. Finally, if we perform a delayed buffer write on this latter part, it will re-insert the delayed extent and cause an error in space accounting. In adition, if the unwritten extent cache is not shrunk during the splitting, ext4_cache_extents() also conflicts with existing extents when caching extents. In the future, we will add checks when caching extents, which will trigger a warning. Therefore, Do not cache extents that are being split. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Message-ID: <20251129103247.686136-6-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-18ext4: correct the mapping status if the extent has been zeroedZhang Yi
Before submitting I/O and allocating blocks with the EXT4_GET_BLOCKS_PRE_IO flag set, ext4_split_convert_extents() may convert the target extent range to initialized due to ENOSPC, ENOMEM, or EQUOTA errors. However, it still marks the mapping as incorrectly unwritten. Although this may not seem to cause any practical problems, it will result in an unnecessary extent conversion operation after I/O completion. Therefore, it's better to correct the returned mapping status. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Message-ID: <20251129103247.686136-5-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-18ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before submitting I/OZhang Yi
When allocating blocks during within-EOF DIO and writeback with dioread_nolock enabled, EXT4_GET_BLOCKS_PRE_IO was set to split an existing large unwritten extent. However, EXT4_GET_BLOCKS_CONVERT was set when calling ext4_split_convert_extents(), which may potentially result in stale data issues. Assume we have an unwritten extent, and then DIO writes the second half. [UUUUUUUUUUUUUUUU] on-disk extent U: unwritten extent [UUUUUUUUUUUUUUUU] extent status tree |<- ->| ----> dio write this range First, ext4_iomap_alloc() call ext4_map_blocks() with EXT4_GET_BLOCKS_PRE_IO, EXT4_GET_BLOCKS_UNWRIT_EXT and EXT4_GET_BLOCKS_CREATE flags set. ext4_map_blocks() find this extent and call ext4_split_convert_extents() with EXT4_GET_BLOCKS_CONVERT and the above flags set. Then, ext4_split_convert_extents() calls ext4_split_extent() with EXT4_EXT_MAY_ZEROOUT, EXT4_EXT_MARK_UNWRIT2 and EXT4_EXT_DATA_VALID2 flags set, and it calls ext4_split_extent_at() to split the second half with EXT4_EXT_DATA_VALID2, EXT4_EXT_MARK_UNWRIT1, EXT4_EXT_MAY_ZEROOUT and EXT4_EXT_MARK_UNWRIT2 flags set. However, ext4_split_extent_at() failed to insert extent since a temporary lack -ENOSPC. It zeroes out the first half but convert the entire on-disk extent to written since the EXT4_EXT_DATA_VALID2 flag set, but left the second half as unwritten in the extent status tree. [0000000000SSSSSS] data S: stale data, 0: zeroed [WWWWWWWWWWWWWWWW] on-disk extent W: written extent [WWWWWWWWWWUUUUUU] extent status tree Finally, if the DIO failed to write data to the disk, the stale data in the second half will be exposed once the cached extent entry is gone. Fix this issue by not passing EXT4_GET_BLOCKS_CONVERT when splitting an unwritten extent before submitting I/O, and make ext4_split_convert_extents() to zero out the entire extent range to zero for this case, and also mark the extent in the extent status tree for consistency. Fixes: b8a8684502a0 ("ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Message-ID: <20251129103247.686136-4-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>