summaryrefslogtreecommitdiff
path: root/fs/xfs/libxfs
AgeCommit message (Collapse)Author
2026-04-27xfs: zero directory data block padding on write verificationYuto Ohnuki
Old kernels did not zero the pad field in xfs_dir3_data_hdr when initializing directory data blocks, so existing filesystems may have non-zero padding on disk. Zero the pad field in xfs_dir3_data_write_verify alongside the existing LSN and checksum updates. The pad field is pure alignment padding with no runtime meaning, so zeroing it during write verification is safe and has no additional I/O cost. This lets filesystems gradually self-heal stale non-zero padding as directories are modified, without requiring an explicit repair pass. Suggested-by: Dave Chinner <dgc@kernel.org> Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-04-27xfs: zero entire directory data block header region at initYuto Ohnuki
xfs_dir3_data_init currently zeroes only the xfs_dir3_blk_hdr portion of the directory data block header, then manually initializes the bestfree entries in a loop. This leaves the pad field in xfs_dir3_data_hdr uninitialized and requires explicit zeroing of each bestfree slot. Zero the entire header region (geo->data_entry_offset bytes) unconditionally before setting individual fields. This covers all current and future header fields, all padding (implicit and explicit), and the bestfree array, so the manual zeroing loop for bestfree can be removed. Suggested-by: Dave Chinner <dgc@kernel.org> Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-04-27xfs: remove the meaningless XFS_ALLOC_FLAG_FREEINGJinliang Zheng
In xfs_refcount_finish_one(), there's no need to pass XFS_ALLOC_FLAG_FREEING to xfs_alloc_read_agf(). So remove it. Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-30xfs: switch (back) to a per-buftarg buffer hashChristoph Hellwig
The per-AG buffer hashes were added when all buffer lookups took a per-hash look. Since then we've made lookups entirely lockless and removed the need for a hash-wide lock for inserts and removals as well. With this there is no need to sharding the hash, so reduce the used resources by using a per-buftarg hash for all buftargs. Long after writing this initially, syzbot found a problem in the buffer cache teardown order, which this happens to fix as well by doing the entire buffer cache teardown in one places instead of splitting it between destroying the buftarg and the perag structures. Link: https://lore.kernel.org/linux-xfs/aLeUdemAZ5wmtZel@dread.disaster.area/ Reported-by: syzbot+0391d34e801643e2809b@syzkaller.appspotmail.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Tested-by: syzbot+0391d34e801643e2809b@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-23Merge branch 'xfs-7.0-fixes' into for-nextCarlos Maiolino
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-23xfs: factor out xfs_attr3_leaf_initLong Li
Factor out wrapper xfs_attr3_leaf_init function, which exported for external use. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-23xfs: factor out xfs_attr3_node_entry_removeLong Li
Factor out wrapper xfs_attr3_node_entry_remove function, which exported for external use. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Long Li <leo.lilong@huawei.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-18Merge branch 'xfs-7.1-merge' into for-nextCarlos Maiolino
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-18xfs: annotate struct xfs_attr_list_context with __counted_by_ptrBill Wendling
Add the `__counted_by_ptr` attribute to the `buffer` field of `struct xfs_attr_list_context`. This field is used to point to a buffer of size `bufsize`. The `buffer` field is assigned in: 1. `xfs_ioc_attr_list` in `fs/xfs/xfs_handle.c` 2. `xfs_xattr_list` in `fs/xfs/xfs_xattr.c` 3. `xfs_getparents` in `fs/xfs/xfs_handle.c` (implicitly initialized to NULL) In `xfs_ioc_attr_list`, `buffer` was assigned before `bufsize`. Reorder them to ensure `bufsize` is set before `buffer` is assigned, although no access happens between them. In `xfs_xattr_list`, `buffer` was assigned before `bufsize`. Reorder them to ensure `bufsize` is set before `buffer` is assigned. In `xfs_getparents`, `buffer` is NULL (from zero initialization) and remains NULL. `bufsize` is set to a non-zero value, but since `buffer` is NULL, no access occurs. In all cases, the pointer `buffer` is not accessed before `bufsize` is set. This patch was generated by CodeMender and reviewed by Bill Wendling. Tested by running xfstests. Signed-off-by: Bill Wendling <morbo@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-06xfs: fix returned valued from xfs_defer_can_appendCarlos Maiolino
xfs_defer_can_append returns a bool, it shouldn't be returning a NULL. Found by code inspection. Fixes: 4dffb2cbb483 ("xfs: allow pausing of pending deferred work items") Cc: <stable@vger.kernel.org> # v6.8 Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Souptick Joarder <souptick.joarder@hpe.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-05xfs: Remove redundant NULL check after __GFP_NOFAILhongao
kzalloc() is called with __GFP_NOFAIL, so a NULL return is not expected. Drop the redundant !map check in xfs_dabuf_map(). Also switch the nirecs-sized allocation to kcalloc(). Signed-off-by: hongao <hongao@uniontech.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-03-04xfs: add write pointer to xfs_rtgroup_geometryWilfred Mallawa
There is currently no XFS ioctl that allows userspace to retrieve the write pointer for a specific realtime group block for zoned XFS. On zoned block devices, userspace can obtain this information via zone reports from the underlying device. However, for zoned XFS operating on regular block devices, no equivalent mechanism exists. Access to the realtime group write pointer is useful to userspace development and analysis tools such as Zonar [1]. So extend the existing struct xfs_rtgroup_geometry to add a new rg_writepointer field. This field is valid if XFS_RTGROUP_GEOM_WRITEPOINTER flag is set. The rg_writepointer field specifies the location of the current writepointer as a block offset into the respective rtgroup. [1] https://lwn.net/Articles/1059364/ Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-02-25xfs: add static size checks for ioctl UABIWilfred Mallawa
The ioctl structures in libxfs/xfs_fs.h are missing static size checks. It is useful to have static size checks for these structures as adding new fields to them could cause issues (e.g. extra padding that may be inserted by the compiler). So add these checks to xfs/xfs_ondisk.h. Due to different padding/alignment requirements across different architectures, to avoid build failures, some structures are ommited from the size checks. For example, structures with "compat_" definitions in xfs/xfs_ioctl32.h are ommited. Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-02-25xfs: remove duplicate static size checksWilfred Mallawa
In libxfs/xfs_ondisk.h, remove some duplicate entries of XFS_CHECK_STRUCT_SIZE(). Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-02-25xfs: Add a comment in xfs_log_sb()Nirjhar Roy (IBM)
Add a comment explaining why the sb_frextents are updated outside the if (xfs_has_lazycount(mp) check even though it is a lazycounter. RT groups are supported only in v5 filesystems which always have lazycounter enabled - so putting it inside the if(xfs_has_lazycount(mp) check is redundant. Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-02-25xfs: remove metafile inodes from the active inode statChristoph Hellwig
The active inode (or active vnode until recently) stat can get much larger than expected on file systems with a lot of metafile inodes like zoned file systems on SMR hard disks with 10.000s of rtg rmap inodes. Remove all metafile inodes from the active counter to make it more useful to track actual workloads and add a separate counter for active metafile inodes. This fixes xfs/177 on SMR hard drives. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-02-25xfs: fix code alignment issues in xfs_ondisk.cWilfred Mallawa
Fixup some code alignment issues in xfs_ondisk.c Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-02-25xfs: Refactoring the nagcount and delta calculationNirjhar Roy (IBM)
Introduce xfs_growfs_compute_delta() to calculate the nagcount and delta blocks and refactor the code from xfs_growfs_data_private(). No functional changes. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-02-21Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-01-30xfs: give the defer_relog stat a xs_ prefixChristoph Hellwig
Make this counter naming consistent with all the others. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-30xfs: add zone reset error injectionChristoph Hellwig
Add a new errortag to test that zone reset errors are handled correctly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-30xfs: don't validate error tags in the I/O pathChristoph Hellwig
We can trust XFS developers enough to not pass random stuff to XFS_ERROR_TEST/DELAY. Open code the validity check in xfs_errortag_add, which is the only place that receives unvalidated error tag values from user space, and drop the now pointless xfs_errortag_enabled helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-29xfs: fix spacing style issues in xfs_alloc.cShin Seong-jun
Fix checkpatch.pl errors regarding missing spaces around assignment operators in xfs_alloc_compute_diff() and xfs_alloc_fixup_trees(). Adhere to the Linux kernel coding style by ensuring spaces are placed around the assignment operator '='. Signed-off-by: Shin Seong-jun <shinsj4653@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-28Merge tag 'attr-pptr-speedup-7.0_2026-01-25' of ↵Carlos Maiolino
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-7.0-merge xfs: improve shortform attr performance [2/3] Improve performance of the xattr (and parent pointer) code when the attr structure is in short format and we can therefore perform all updates in a single transaction. Avoiding the attr intent code brings a very nice speedup in those operations. With a bit of luck, this should all go splendidly. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-28Merge tag 'attr-leaf-freemap-fixes-7.0_2026-01-25' of ↵Carlos Maiolino
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-7.0-merge xfs: fix problems in the attr leaf freemap code [1/3] Running generic/753 for hours revealed data corruption problems in the attr leaf block space management code. Under certain circumstances, freemap entries are left with zero size but a nonzero offset. If that offset happens to be the same offset as the end of the entries array during an attr set operation, the leaf entry table expansion will push the freemap record offset upwards without checking for overlap with any other freemap entries. If there happened to be a second freemap entry overlapping with the newly allocated leaf entry space, then the next attr set operation might find that space and overwrite the leaf entry, thereby corrupting the leaf block. Fix this by zeroing the freemap offset any time we set the size to zero. If a subsequent attr set operation finds no space in the freemap, it will compact the block and regenerate the freemaps. With a bit of luck, this should all go splendidly. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-28Merge tag 'health-monitoring-7.0_2026-01-20' of ↵Carlos Maiolino
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-7.0-merge xfs: autonomous self healing of filesystems [v7] This patchset builds new functionality to deliver live information about filesystem health events to userspace. This is done by creating an anonymous file that can be read() for events by userspace programs. Events are captured by hooking various parts of XFS and iomap so that metadata health failures, file I/O errors, and major changes in filesystem state (unmounts, shutdowns, etc.) can be observed by programs. When an event occurs, the hook functions queue an event object to each event anonfd for later processing. Programs must have CAP_SYS_ADMIN to open the anonfd and there's a maximum event lag to prevent resource overconsumption. The events themselves can be read() from the anonfd as C structs for the xfs_healer daemon. In userspace, we create a new daemon program that will read the event objects and initiate repairs automatically. This daemon is managed entirely by systemd and will not block unmounting of the filesystem unless repairs are ongoing. They are auto-started by a starter service that uses fanotify. This patchset depends on the new fserror code that Christian Brauner has tentatively accepted for Linux 7.0: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs-7.0.fserror v7: more cleanups of the media verification ioctl, improve comments, and reuse the bio v6: fix pi-breaking bugs, make verify failures trigger health reports and filter bio status flags better v5: add verify-media ioctl, collapse small helper funcs with only one caller v4: drop multiple client support so we can make direct calls into healthmon instead of chasing pointers and doing indirect calls v3: drag out of rfc status With a bit of luck, this should all go splendidly. Conflicts: This merge required an update on files: - fs/xfs/xfs_healthmon.c - fs/xfs/xfs_verify_media.c Such change was required because a parallel developement changed XFS header file xfs.h naming to xfs_platform.h, so the merge required to update those includes in both files above Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-23xfs: add a method to replace shortform attrsDarrick J. Wong
If we're trying to replace an xattr in a shortform attr structure and the old entry fits the new entry, we can just memcpy and exit without having to delete, compact, and re-add the entry (or worse use the attr intent machinery). For parent pointers this only advantages renaming where the filename length stays the same (e.g. mv autoexec.bat scandisk.exe) but for regular xattrs it might be useful for updating security labels and the like. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-23xfs: speed up parent pointer operations when possibleDarrick J. Wong
After a recent fsmark benchmarking run, I observed that the overhead of parent pointers on file creation and deletion can be a bit high. On a machine with 20 CPUs, 128G of memory, and an NVME SSD capable of pushing 750000iops, I see the following results: $ mkfs.xfs -f -l logdev=/dev/nvme1n1,size=1g /dev/nvme0n1 -n parent=0 meta-data=/dev/nvme0n1 isize=512 agcount=40, agsize=9767586 blks = sectsz=4096 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=1 = reflink=1 bigtime=1 inobtcount=1 nrext64=1 = exchange=0 metadir=0 data = bsize=4096 blocks=390703440, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0 log =/dev/nvme1n1 bsize=4096 blocks=262144, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 = rgcount=0 rgsize=0 extents = zoned=0 start=0 reserved=0 So we created 40 AGs, one per CPU. Now we create 40 directories and run fsmark: $ time fs_mark -D 10000 -S 0 -n 100000 -s 0 -L 8 -d ... # Version 3.3, 40 thread(s) starting at Wed Dec 10 14:22:07 2025 # Sync method: NO SYNC: Test does not issue sync() or fsync() calls. # Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory. # File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name) # Files info: size 0 bytes, written with an IO size of 16384 bytes per write # App overhead is time in microseconds spent in the test not doing file writing related system calls. parent=0 parent=1 ================== ================== real 0m57.573s real 1m2.934s user 3m53.578s user 3m53.508s sys 19m44.440s sys 25m14.810s $ time rm -rf ... parent=0 parent=1 ================== ================== real 0m59.649s real 1m12.505s user 0m41.196s user 0m47.489s sys 13m9.566s sys 20m33.844s Parent pointers increase the system time by 28% overhead to create 32 million files that are totally empty. Removing them incurs a system time increase of 56%. Wall time increases by 9% and 22%. For most filesystems, each file tends to have a single owner and not that many xattrs. If the xattr structure is shortform, then all xattr changes are logged with the inode and do not require the the xattr intent mechanism to persist the parent pointer. Therefore, we can speed up parent pointer operations by calling the shortform xattr functions directly if the child's xattr is in short format. Now the overhead looks like: $ time fs_mark -D 10000 -S 0 -n 100000 -s 0 -L 8 -d ... parent=0 parent=1 ================== ================== real 0m58.030s real 1m0.983s user 3m54.141s user 3m53.758s sys 19m57.003s sys 21m30.605s $ time rm -rf ... parent=0 parent=1 ================== ================== real 0m58.911s real 1m4.420s user 0m41.329s user 0m45.169s sys 13m27.857s sys 15m58.564s Now parent pointers only increase the system time by 8% for creation and 19% for deletion. Wall time increases by 5% and 9% now. Close the performance gap by creating helpers for the attr set, remove, and replace operations that will try to make direct shortform updates, and fall back to the attr intent machinery if that doesn't work. This works for regular xattrs and for parent pointers. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-23xfs: reduce xfs_attr_try_sf_addname parametersDarrick J. Wong
The dp parameter to this function is an alias of args->dp, so remove it for clarity before we go adding new callers. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-23xfs: strengthen attr leaf block freemap checkingDarrick J. Wong
Check for erroneous overlapping freemap regions and collisions between freemap regions and the xattr leaf entry array. Note that we must explicitly zero out the extra freemaps in xfs_attr3_leaf_compact so that the in-memory buffer has a correctly initialized freemap array to satisfy the new verification code, even if subsequent code changes the contents before unlocking the buffer. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-23xfs: refactor attr3 leaf table size computationDarrick J. Wong
Replace all the open-coded callsites with a single static inline helper. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-23xfs: fix freemap adjustments when adding xattrs to leaf blocksDarrick J. Wong
xfs/592 and xfs/794 both trip this assertion in the leaf block freemap adjustment code after ~20 minutes of running on my test VMs: ASSERT(ichdr->firstused >= ichdr->count * sizeof(xfs_attr_leaf_entry_t) + xfs_attr3_leaf_hdr_size(leaf)); Upon enabling quite a lot more debugging code, I narrowed this down to fsstress trying to set a local extended attribute with namelen=3 and valuelen=71. This results in an entry size of 80 bytes. At the start of xfs_attr3_leaf_add_work, the freemap looks like this: i 0 base 448 size 0 rhs 448 count 46 i 1 base 388 size 132 rhs 448 count 46 i 2 base 2120 size 4 rhs 448 count 46 firstused = 520 where "rhs" is the first byte past the end of the leaf entry array. This is inconsistent -- the entries array ends at byte 448, but freemap[1] says there's free space starting at byte 388! By the end of the function, the freemap is in worse shape: i 0 base 456 size 0 rhs 456 count 47 i 1 base 388 size 52 rhs 456 count 47 i 2 base 2120 size 4 rhs 456 count 47 firstused = 440 Important note: 388 is not aligned with the entries array element size of 8 bytes. Based on the incorrect freemap, the name area starts at byte 440, which is below the end of the entries array! That's why the assertion triggers and the filesystem shuts down. How did we end up here? First, recall from the previous patch that the freemap array in an xattr leaf block is not intended to be a comprehensive map of all free space in the leaf block. In other words, it's perfectly legal to have a leaf block with: * 376 bytes in use by the entries array * freemap[0] has [base = 376, size = 8] * freemap[1] has [base = 388, size = 1500] * the space between 376 and 388 is free, but the freemap stopped tracking that some time ago If we add one xattr, the entries array grows to 384 bytes, and freemap[0] becomes [base = 384, size = 0]. So far, so good. But if we add a second xattr, the entries array grows to 392 bytes, and freemap[0] gets pushed up to [base = 392, size = 0]. This is bad, because freemap[1] hasn't been updated, and now the entries array and the free space claim the same space. The fix here is to adjust all freemap entries so that none of them collide with the entries array. Note that this fix relies on commit 2a2b5932db6758 ("xfs: fix attr leaf header freemap.size underflow") and the previous patch that resets zero length freemap entries to have base = 0. Cc: <stable@vger.kernel.org> # v2.6.12 Fixes: 1da177e4c3f415 ("Linux-2.6.12-rc2") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-23xfs: delete attr leaf freemap entries when emptyDarrick J. Wong
Back in commit 2a2b5932db6758 ("xfs: fix attr leaf header freemap.size underflow"), Brian Foster observed that it's possible for a small freemap at the end of the end of the xattr entries array to experience a size underflow when subtracting the space consumed by an expansion of the entries array. There are only three freemap entries, which means that it is not a complete index of all free space in the leaf block. This code can leave behind a zero-length freemap entry with a nonzero base. Subsequent setxattr operations can increase the base up to the point that it overlaps with another freemap entry. This isn't in and of itself a problem because the code in _leaf_add that finds free space ignores any freemap entry with zero size. However, there's another bug in the freemap update code in _leaf_add, which is that it fails to update a freemap entry that begins midway through the xattr entry that was just appended to the array. That can result in the freemap containing two entries with the same base but different sizes (0 for the "pushed-up" entry, nonzero for the entry that's actually tracking free space). A subsequent _leaf_add can then allocate xattr namevalue entries on top of the entries array, leading to data loss. But fixing that is for later. For now, eliminate the possibility of confusion by zeroing out the base of any freemap entry that has zero size. Because the freemap is not intended to be a complete index of free space, a subsequent failure to find any free space for a new xattr will trigger block compaction, which regenerates the freemap. It looks like this bug has been in the codebase for quite a long time. Cc: <stable@vger.kernel.org> # v2.6.12 Fixes: 1da177e4c3f415 ("Linux-2.6.12-rc2") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-21xfs: split and refactor zone validationChristoph Hellwig
Currently xfs_zone_validate mixes validating the software zone state in the XFS realtime group with validating the hardware state reported in struct blk_zone and deriving the write pointer from that. Move all code that works on the realtime group to xfs_init_zone, and only keep the hardware state validation in xfs_zone_validate. This makes the code more clear, and allows for better reuse in userspace. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: add a xfs_rtgroup_raw_size helperChristoph Hellwig
Add a helper to figure the on-disk size of a group, accounting for the XFS_SB_FEAT_INCOMPAT_ZONE_GAPS feature if needed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: add missing forward declaration in xfs_zones.hDamien Le Moal
Add the missing forward declaration for struct blk_zone in xfs_zones.h. This avoids headaches with the order of header file inclusion to avoid compilation errors. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: remove xfs_attr_leaf_hasnameChristoph Hellwig
The calling convention of xfs_attr_leaf_hasname() is problematic, because it returns a NULL buffer when xfs_attr3_leaf_read fails, a valid buffer when xfs_attr3_leaf_lookup_int returns -ENOATTR or -EEXIST, and a non-NULL buffer pointer for an already released buffer when xfs_attr3_leaf_lookup_int fails with other error values. Fix this by simply open coding xfs_attr_leaf_hasname in the callers, so that the buffer release code is done by each caller of xfs_attr3_leaf_read. Cc: stable@vger.kernel.org # v5.19+ Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines") Reported-by: Mark Tinguely <mark.tinguely@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: directly include xfs_platform.hChristoph Hellwig
The xfs.h header conflicts with the public xfs.h in xfsprogs, leading to a spurious difference in all shared libxfs files that have to include libxfs_priv.h in userspace. Directly include xfs_platform.h so that we can add a header of the same name to xfsprogs and remove this major annoyance for the shared code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: move struct xfs_log_iovec to xfs_log_priv.hChristoph Hellwig
This structure is now only used by the core logging and CIL code. Also remove the unused typedef. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-20xfs: add media verification ioctlDarrick J. Wong
Add a new privileged ioctl so that xfs_scrub can ask the kernel to verify the media of the devices backing an xfs filesystem, and have any resulting media errors reported to fsnotify and xfs_healer. To accomplish this, the kernel allocates a folio between the base page size and 1MB, and issues read IOs to a gradually incrementing range of one of the storage devices underlying an xfs filesystem. If any error occurs, that raw error is reported to the calling process. If the error happens to be one of the ones that the kernel considers indicative of data loss, then it will also be reported to xfs_healthmon and fsnotify. Driving the verification from the kernel enables xfs (and by extension xfs_scrub) to have precise control over the size and error handling of IOs that are issued to the underlying block device, and to emit notifications about problems to other relevant kernel subsystems immediately. Note that the caller is also allowed to reduce the size of the IO and to ask for a relaxation period after each IO. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-20xfs: check if an open file is on the health monitored fsDarrick J. Wong
Create a new ioctl for the healthmon file that checks that a given fd points to the same filesystem that the healthmon file is monitoring. This allows xfs_healer to check that when it reopens a mountpoint to perform repairs, the file that it gets matches the filesystem that generated the corruption report. (Note that xfs_healer doesn't maintain an open fd to a filesystem that it's monitoring so that it doesn't pin the mount.) Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-20xfs: convey file I/O errors to the health monitorDarrick J. Wong
Connect the fserror reporting to the health monitor so that xfs can send events about file I/O errors to the xfs_healer daemon. These events are entirely informational because xfs cannot regenerate user data, so hopefully the fsnotify I/O error event gets noticed by the relevant management systems. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-20xfs: convey externally discovered fsdax media errors to the health monitorDarrick J. Wong
Connect the fsdax media failure notification code to the health monitor so that xfs can send events about that to the xfs_healer daemon. Later on we'll add the ability for the xfs_scrub media scan (phase 6) to report the errors that it finds to the kernel so that those are also logged by xfs_healer. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-20xfs: convey filesystem shutdown events to the health monitorDarrick J. Wong
Connect the filesystem shutdown code to the health monitor so that xfs can send events about that to the xfs_healer daemon. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-20xfs: convey metadata health events to the health monitorDarrick J. Wong
Connect the filesystem metadata health event collection system to the health monitor so that xfs can send events to xfs_healer as it collects information. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-20xfs: convey filesystem unmount events to the health monitorDarrick J. Wong
In xfs_healthmon_unmount, send events to xfs_healer so that it knows that nothing further can be done for the filesystem. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-20xfs: create event queuing, formatting, and discovery infrastructureDarrick J. Wong
Create the basic infrastructure that we need to report health events to userspace. We need a compact form for recording critical information about an event and queueing them; a means to notice that we've lost some events; and a means to format the events into something that userspace can handle. Make the kernel export C structures via read(). In a previous iteration of this new subsystem, I wanted to explore data exchange formats that are more flexible and easier for humans to read than C structures. The thought being that when we want to rev (or worse, enlarge) the event format, it ought to be trivially easy to do that in a way that doesn't break old userspace. I looked at formats such as protobufs and capnproto. These look really nice in that extending the wire format is fairly easy, you can give it a data schema and it generates the serialization code for you, handles endianness problems, etc. The huge downside is that neither support C all that well. Too hard, and didn't want to port either of those huge sprawling libraries first to the kernel and then again to xfsprogs. Then I thought, how about JSON? Javascript objects are human readable, the kernel can emit json without much fuss (it's all just strings!) and there are plenty of interpreters for python/rust/c/etc. There's a proposed schema format for json, which means that xfs can publish a description of the events that kernel will emit. Userspace consumers (e.g. xfsprogs/xfs_healer) can embed the same schema document and use it to validate the incoming events from the kernel, which means it can discard events that it doesn't understand, or garbage being emitted due to bugs. However, json has a huge crutch -- javascript is well known for its vague definitions of what are numbers. This makes expressing a large number rather fraught, because the runtime is free to represent a number in nearly any way it wants. Stupider ones will truncate values to word size, others will roll out doubles for uint52_t (yes, fifty-two) with the resulting loss of precision. Not good when you're dealing with discrete units. It just so happens that python's json library is smart enough to see a sequence of digits and put them in a u64 (at least on x86_64/aarch64) but an actual javascript interpreter (pasting into Firefox) isn't necessarily so clever. It turns out that none of the proposed json schemas were ever ratified even in an open-consensus way, so json blobs are still just loosely structured blobs. The parsing in userspace was also noticeably slow and memory-consumptive. Hence only the C interface survives. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-20xfs: start creating infrastructure for health monitoringDarrick J. Wong
Start creating helper functions and infrastructure to pass filesystem health events to a health monitoring file. Since this is an administrative interface, we only support a single health monitor process per filesystem, so we don't need to use anything fancy such as notifier chains (== tons of indirect calls). Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-13xfs: set max_agbno to allow sparse alloc of last full inode chunkBrian Foster
Sparse inode cluster allocation sets min/max agbno values to avoid allocating an inode cluster that might map to an invalid inode chunk. For example, we can't have an inode record mapped to agbno 0 or that extends past the end of a runt AG of misaligned size. The initial calculation of max_agbno is unnecessarily conservative, however. This has triggered a corner case allocation failure where a small runt AG (i.e. 2063 blocks) is mostly full save for an extent to the EOFS boundary: [2050,13]. max_agbno is set to 2048 in this case, which happens to be the offset of the last possible valid inode chunk in the AG. In practice, we should be able to allocate the 4-block cluster at agbno 2052 to map to the parent inode record at agbno 2048, but the max_agbno value precludes it. Note that this can result in filesystem shutdown via dirty trans cancel on stable kernels prior to commit 9eb775968b68 ("xfs: walk all AGs if TRYLOCK passed to xfs_alloc_vextent_iterate_ags") because the tail AG selection by the allocator sets t_highest_agno on the transaction. If the inode allocator spins around and finds an inode chunk with free inodes in an earlier AG, the subsequent dir name creation path may still fail to allocate due to the AG restriction and cancel. To avoid this problem, update the max_agbno calculation to the agbno prior to the last chunk aligned agbno in the AG. This is not necessarily the last valid allocation target for a sparse chunk, but since inode chunks (i.e. records) are chunk aligned and sparse allocs are cluster sized/aligned, this allows the sb_spino_align alignment restriction to take over and round down the max effective agbno to within the last valid inode chunk in the AG. Note that even though the allocator improvements in the aforementioned commit seem to avoid this particular dirty trans cancel situation, the max_agbno logic improvement still applies as we should be able to allocate from an AG that has been appropriately selected. The more important target for this patch however are older/stable kernels prior to this allocator rework/improvement. Cc: stable@vger.kernel.org # v4.2 Fixes: 56d1115c9bc7 ("xfs: allocate sparse inode chunks on full chunk allocation failure") Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>