<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/fs/xfs, branch v6.16</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>xfs: don't allocate the xfs_extent_busy structure for zoned RTGs</title>
<updated>2025-07-18T15:42:31+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2025-07-16T12:54:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=5948705adbf1a7afcecfe9a13ff39221ef61e16b'/>
<id>5948705adbf1a7afcecfe9a13ff39221ef61e16b</id>
<content type='text'>
Busy extent tracking is primarily used to ensure that freed blocks are
not reused for data allocations before the transaction that deleted them
has been committed to stable storage, and secondarily to drive online
discard.  None of the use cases applies to zoned RTGs, as the zoned
allocator can't overwrite blocks before resetting the zone, which already
flushes out all transactions touching the RTGs.

So the busy extent tracking is not needed for zoned RTGs, and also not
called for zoned RTGs.  But somehow the code to skip allocating and
freeing the structure got lost during the zoned XFS upstreaming process.
This not only causes these structures to unnecessarily allocated, but can
also lead to memory leaks as the xg_busy_extents pointer in the
xfs_group structure is overlayed with the pointer for the linked list
of to be reset zones.

Stop allocating and freeing the structure to not pointlessly allocate
memory which is then leaked when the zone is reset.

Fixes: 080d01c41d44 ("xfs: implement zoned garbage collection")
Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Cc: &lt;stable@vger.kernel.org&gt; # v6.15
[cem: Fix type and add stable tag]
Reviewed-by: Darrick J. Wong &lt;djwong@kernel.org&gt;
Signed-off-by: Carlos Maiolino &lt;cem@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Busy extent tracking is primarily used to ensure that freed blocks are
not reused for data allocations before the transaction that deleted them
has been committed to stable storage, and secondarily to drive online
discard.  None of the use cases applies to zoned RTGs, as the zoned
allocator can't overwrite blocks before resetting the zone, which already
flushes out all transactions touching the RTGs.

So the busy extent tracking is not needed for zoned RTGs, and also not
called for zoned RTGs.  But somehow the code to skip allocating and
freeing the structure got lost during the zoned XFS upstreaming process.
This not only causes these structures to unnecessarily allocated, but can
also lead to memory leaks as the xg_busy_extents pointer in the
xfs_group structure is overlayed with the pointer for the linked list
of to be reset zones.

Stop allocating and freeing the structure to not pointlessly allocate
memory which is then leaked when the zone is reset.

Fixes: 080d01c41d44 ("xfs: implement zoned garbage collection")
Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Cc: &lt;stable@vger.kernel.org&gt; # v6.15
[cem: Fix type and add stable tag]
Reviewed-by: Darrick J. Wong &lt;djwong@kernel.org&gt;
Signed-off-by: Carlos Maiolino &lt;cem@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>xfs: remove the bt_bdev_file buftarg field</title>
<updated>2025-07-08T11:30:26+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2025-05-23T12:31:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=9b027aa3e8c44ea826fab1928f5d02a186ff1536'/>
<id>9b027aa3e8c44ea826fab1928f5d02a186ff1536</id>
<content type='text'>
And use bt_file for both bdev and shmem backed buftargs.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: John Garry &lt;john.g.garry@oracle.com&gt;
Reviewed-by: Darrick J. Wong &lt;djwong@kernel.org&gt;
Signed-off-by: Carlos Maiolino &lt;cem@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
And use bt_file for both bdev and shmem backed buftargs.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: John Garry &lt;john.g.garry@oracle.com&gt;
Reviewed-by: Darrick J. Wong &lt;djwong@kernel.org&gt;
Signed-off-by: Carlos Maiolino &lt;cem@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>xfs: rename the bt_bdev_* buftarg fields</title>
<updated>2025-07-08T11:30:26+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2025-07-07T12:53:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=988a16827582dfb9256d22f74cb363f41f090c90'/>
<id>988a16827582dfb9256d22f74cb363f41f090c90</id>
<content type='text'>
The extra bdev_ is weird, so drop it.  Also improve the comment to make
it clear these are the hardware limits.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: John Garry &lt;john.g.garry@oracle.com&gt;
Reviewed-by: Darrick J. Wong &lt;djwong@kernel.org&gt;
Signed-off-by: Carlos Maiolino &lt;cem@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The extra bdev_ is weird, so drop it.  Also improve the comment to make
it clear these are the hardware limits.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: John Garry &lt;john.g.garry@oracle.com&gt;
Reviewed-by: Darrick J. Wong &lt;djwong@kernel.org&gt;
Signed-off-by: Carlos Maiolino &lt;cem@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>xfs: refactor xfs_calc_atomic_write_unit_max</title>
<updated>2025-07-08T11:30:26+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2025-07-07T12:53:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=e4a7a3f9b24336059c782eaa7ed5ef88a614a1cf'/>
<id>e4a7a3f9b24336059c782eaa7ed5ef88a614a1cf</id>
<content type='text'>
This function and the helpers used by it duplicate the same logic for AGs
and RTGs.  Use the xfs_group_type enum to unify both variants.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: John Garry &lt;john.g.garry@oracle.com&gt;
Reviewed-by: Darrick J. Wong &lt;djwong@kernel.org&gt;
Signed-off-by: Carlos Maiolino &lt;cem@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This function and the helpers used by it duplicate the same logic for AGs
and RTGs.  Use the xfs_group_type enum to unify both variants.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: John Garry &lt;john.g.garry@oracle.com&gt;
Reviewed-by: Darrick J. Wong &lt;djwong@kernel.org&gt;
Signed-off-by: Carlos Maiolino &lt;cem@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>xfs: add a xfs_group_type_buftarg helper</title>
<updated>2025-07-08T11:30:26+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2025-07-07T12:53:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=e74d1fa6a7d738c009a1dc7d739e64000c0d3d33'/>
<id>e74d1fa6a7d738c009a1dc7d739e64000c0d3d33</id>
<content type='text'>
Generalize the xfs_group_type helper in the discard code to return a buftarg
and move it to xfs_mount.h, and use the result in xfs_dax_notify_dev_failure.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: John Garry &lt;john.g.garry@oracle.com&gt;
Reviewed-by: Darrick J. Wong &lt;djwong@kernel.org&gt;
Signed-off-by: Carlos Maiolino &lt;cem@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Generalize the xfs_group_type helper in the discard code to return a buftarg
and move it to xfs_mount.h, and use the result in xfs_dax_notify_dev_failure.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: John Garry &lt;john.g.garry@oracle.com&gt;
Reviewed-by: Darrick J. Wong &lt;djwong@kernel.org&gt;
Signed-off-by: Carlos Maiolino &lt;cem@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>xfs: remove the call to sync_blockdev in xfs_configure_buftarg</title>
<updated>2025-07-08T11:30:26+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2025-07-07T12:53:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=d9b1e348cff7ed13e30886de7a72e1fa0e235863'/>
<id>d9b1e348cff7ed13e30886de7a72e1fa0e235863</id>
<content type='text'>
This extra call is not needed as xfs_alloc_buftarg already calls
sync_blockdev.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: John Garry &lt;john.g.garry@oracle.com&gt;
Signed-off-by: Carlos Maiolino &lt;cem@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This extra call is not needed as xfs_alloc_buftarg already calls
sync_blockdev.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: John Garry &lt;john.g.garry@oracle.com&gt;
Signed-off-by: Carlos Maiolino &lt;cem@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>xfs: clean up the initial read logic in xfs_readsb</title>
<updated>2025-07-08T11:30:26+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2025-07-07T12:53:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=a578a8efa707cc99c22960e86e5b9eaeeda97c5e'/>
<id>a578a8efa707cc99c22960e86e5b9eaeeda97c5e</id>
<content type='text'>
The initial sb read is always for a device logical block size buffer.
The device logical block size is provided in the bt_logical_sectorsize in
struct buftarg, so use that instead of the confusingly named
xfs_getsize_buftarg buffer that reads it from the bdev.

Update the comments surrounding the code to better describe what is going
on.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: Darrick J. Wong &lt;djwong@kernel.org&gt;
Signed-off-by: Carlos Maiolino &lt;cem@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The initial sb read is always for a device logical block size buffer.
The device logical block size is provided in the bt_logical_sectorsize in
struct buftarg, so use that instead of the confusingly named
xfs_getsize_buftarg buffer that reads it from the bdev.

Update the comments surrounding the code to better describe what is going
on.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: Darrick J. Wong &lt;djwong@kernel.org&gt;
Signed-off-by: Carlos Maiolino &lt;cem@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>xfs: replace strncpy with memcpy in xattr listing</title>
<updated>2025-07-08T09:50:09+00:00</updated>
<author>
<name>Pranav Tyagi</name>
<email>pranav.tyagi03@gmail.com</email>
</author>
<published>2025-06-17T13:14:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=f2eb2796b95118b877b63d9fcd3459e70494a498'/>
<id>f2eb2796b95118b877b63d9fcd3459e70494a498</id>
<content type='text'>
Use memcpy() in place of strncpy() in __xfs_xattr_put_listent().
The length is known and a null byte is added manually.

No functional change intended.

Signed-off-by: Pranav Tyagi &lt;pranav.tyagi03@gmail.com&gt;
Reviewed-by: Darrick J. Wong &lt;djwong@kernel.org&gt;
Reviewed-by: Carlos Maiolino &lt;cmaiolino@redhat.com&gt;
Signed-off-by: Carlos Maiolino &lt;cem@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Use memcpy() in place of strncpy() in __xfs_xattr_put_listent().
The length is known and a null byte is added manually.

No functional change intended.

Signed-off-by: Pranav Tyagi &lt;pranav.tyagi03@gmail.com&gt;
Reviewed-by: Darrick J. Wong &lt;djwong@kernel.org&gt;
Reviewed-by: Carlos Maiolino &lt;cmaiolino@redhat.com&gt;
Signed-off-by: Carlos Maiolino &lt;cem@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>xfs: add FALLOC_FL_ALLOCATE_RANGE to supported flags mask</title>
<updated>2025-06-30T12:16:13+00:00</updated>
<author>
<name>Youling Tang</name>
<email>tangyouling@kylinos.cn</email>
</author>
<published>2025-06-30T01:11:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=9e9b46672b1daac814b384286c21fb8332a87392'/>
<id>9e9b46672b1daac814b384286c21fb8332a87392</id>
<content type='text'>
Add FALLOC_FL_ALLOCATE_RANGE to the set of supported fallocate flags in
XFS_FALLOC_FL_SUPPORTED. This change improves code clarity and maintains
by explicitly showing this flag in the supported flags mask.

Note that since FALLOC_FL_ALLOCATE_RANGE is defined as 0x00, this addition
has no functional modifications.

Reviewed-by: Carlos Maiolino &lt;cmaiolino@redhat.com&gt;
Signed-off-by: Youling Tang &lt;tangyouling@kylinos.cn&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Carlos Maiolino &lt;cem@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Add FALLOC_FL_ALLOCATE_RANGE to the set of supported fallocate flags in
XFS_FALLOC_FL_SUPPORTED. This change improves code clarity and maintains
by explicitly showing this flag in the supported flags mask.

Note that since FALLOC_FL_ALLOCATE_RANGE is defined as 0x00, this addition
has no functional modifications.

Reviewed-by: Carlos Maiolino &lt;cmaiolino@redhat.com&gt;
Signed-off-by: Youling Tang &lt;tangyouling@kylinos.cn&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Carlos Maiolino &lt;cem@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>xfs: fix unmount hang with unflushable inodes stuck in the AIL</title>
<updated>2025-06-27T12:14:37+00:00</updated>
<author>
<name>Dave Chinner</name>
<email>dchinner@redhat.com</email>
</author>
<published>2025-06-25T22:49:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=7b5f775be14ac1532c049022feadcfe44769566d'/>
<id>7b5f775be14ac1532c049022feadcfe44769566d</id>
<content type='text'>
Unmount of a shutdown filesystem can hang with stale inode cluster
buffers in the AIL like so:

[95964.140623] Call Trace:
[95964.144641]  __schedule+0x699/0xb70
[95964.154003]  schedule+0x64/0xd0
[95964.156851]  xfs_ail_push_all_sync+0x9b/0xf0
[95964.164816]  xfs_unmount_flush_inodes+0x41/0x70
[95964.168698]  xfs_unmountfs+0x7f/0x170
[95964.171846]  xfs_fs_put_super+0x3b/0x90
[95964.175216]  generic_shutdown_super+0x77/0x160
[95964.178060]  kill_block_super+0x1b/0x40
[95964.180553]  xfs_kill_sb+0x12/0x30
[95964.182796]  deactivate_locked_super+0x38/0x100
[95964.185735]  deactivate_super+0x41/0x50
[95964.188245]  cleanup_mnt+0x9f/0x160
[95964.190519]  __cleanup_mnt+0x12/0x20
[95964.192899]  task_work_run+0x89/0xb0
[95964.195221]  resume_user_mode_work+0x4f/0x60
[95964.197931]  syscall_exit_to_user_mode+0x76/0xb0
[95964.201003]  do_syscall_64+0x74/0x130

$ pstree -N mnt |grep umount
	     |-check-parallel---nsexec---run_test.sh---753---umount

It always seems to be generic/753 that triggers this, and repeating
a quick group test run triggers it every 10-15 iterations. Hence it
generally triggers once up every 30-40 minutes of test time. just
running generic/753 by itself or concurrently with a limited group
of tests doesn't reproduce this issue at all.

Tracing on a hung system shows the AIL repeating every 50ms a log
force followed by an attempt to push pinned, aborted inodes from the
AIL (trimmed for brevity):

 xfs_log_force:   lsn 0x1c caller xfsaild+0x18e
 xfs_log_force:   lsn 0x0 caller xlog_cil_flush+0xbd
 xfs_log_force:   lsn 0x1c caller xfs_log_force+0x77
 xfs_ail_pinned:  lip 0xffff88826014afa0 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
 xfs_ail_pinned:  lip 0xffff88814000a708 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
 xfs_ail_pinned:  lip 0xffff88810b850c80 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
 xfs_ail_pinned:  lip 0xffff88810b850af0 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
 xfs_ail_pinned:  lip 0xffff888165cf0a28 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
 xfs_ail_pinned:  lip 0xffff88810b850bb8 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
 ....

The inode log items are marked as aborted, which means that either:

a) a transaction commit has occurred, seen an error or shutdown, and
called xfs_trans_free_items() to abort the items. This should happen
before any pinning of log items occurs.

or

b) a dirty transaction has been cancelled. This should also happen
before any pinning of log items occurs.

or

c) AIL insertion at journal IO completion is marked as aborted. In
this case, the log item is pinned by the CIL until journal IO
completes and hence needs to be unpinned. This is then done after
the -&gt;iop_committed() callback is run, so the pin count should be
balanced correctly.

Yet none of these seemed to be occurring. Further tracing indicated
this:

d) Shutdown during CIL pushing resulting in log item completion
being called from checkpoint abort processing. Items are unpinned
and released without serialisation against each other, journal IO
completion or transaction commit completion.

In this case, we may still have a transaction commit in flight that
holds a reference to a xfs_buf_log_item (BLI) after CIL insertion.
e.g. a synchronous transaction will flush the CIL before the
transaction is torn down.  The concurrent CIL push then aborts
insertion it and drops the commit/AIL reference to the BLI. This can
leave the transaction commit context with the last reference to the
BLI which is dropped here:

xfs_trans_free_items()
  -&gt;iop_release
    xfs_buf_item_release
      xfs_buf_item_put
        if (XFS_LI_ABORTED)
	  xfs_trans_ail_delete
	xfs_buf_item_relse()

Unlike the journal completion -&gt;iop_unpin path, this path does not
run stale buffer completion process when it drops the last
reference, hence leaving the stale inodes attached to the buffer
sitting the AIL. There are no other references to those inodes, so
there is no other mechanism to remove them from the AIL. Hence
unmount hangs.

The buffer lock context for stale buffers is passed to the last BLI
reference. This is normally the last BLI unpin on journal IO
completion. The unpin then processes the stale buffer completion and
releases the buffer lock.  However, if the final unpin from journal
IO completion (or CIL push abort) does not hold the last reference
to the BLI, there -must- still be a transaction context that
references the BLI, and so that context must perform the stale
buffer completion processing before the buffer is unlocked and the
BLI torn down.

The fix for this is to rework the xfs_buf_item_relse() path to run
stale buffer completion processing if it drops the last reference to
the BLI. We still hold the buffer locked, so the buffer owner and
lock context is the same as if we passed the BLI and buffer to the
-&gt;iop_unpin() context to finish stale process on journal commit.

However, we have to be careful here. In a shutdown state, we can be
freeing dirty BLIs from xfs_buf_item_put() via xfs_trans_brelse()
and xfs_trans_bdetach().  The existing code handles this case by
considering shutdown state as "aborted", but in doing so
largely masks the failure to clean up stale BLI state from the
xfs_buf_item_relse() path. i.e  regardless of the shutdown state and
whether the item is in the AIL, we must finish the stale buffer
cleanup if we are are dropping the last BLI reference from the
-&gt;iop_relse path in transaction commit context.

Signed-off-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Reviewed-by: Carlos Maiolino &lt;cmaiolino@redhat.com&gt;
Signed-off-by: Carlos Maiolino &lt;cem@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Unmount of a shutdown filesystem can hang with stale inode cluster
buffers in the AIL like so:

[95964.140623] Call Trace:
[95964.144641]  __schedule+0x699/0xb70
[95964.154003]  schedule+0x64/0xd0
[95964.156851]  xfs_ail_push_all_sync+0x9b/0xf0
[95964.164816]  xfs_unmount_flush_inodes+0x41/0x70
[95964.168698]  xfs_unmountfs+0x7f/0x170
[95964.171846]  xfs_fs_put_super+0x3b/0x90
[95964.175216]  generic_shutdown_super+0x77/0x160
[95964.178060]  kill_block_super+0x1b/0x40
[95964.180553]  xfs_kill_sb+0x12/0x30
[95964.182796]  deactivate_locked_super+0x38/0x100
[95964.185735]  deactivate_super+0x41/0x50
[95964.188245]  cleanup_mnt+0x9f/0x160
[95964.190519]  __cleanup_mnt+0x12/0x20
[95964.192899]  task_work_run+0x89/0xb0
[95964.195221]  resume_user_mode_work+0x4f/0x60
[95964.197931]  syscall_exit_to_user_mode+0x76/0xb0
[95964.201003]  do_syscall_64+0x74/0x130

$ pstree -N mnt |grep umount
	     |-check-parallel---nsexec---run_test.sh---753---umount

It always seems to be generic/753 that triggers this, and repeating
a quick group test run triggers it every 10-15 iterations. Hence it
generally triggers once up every 30-40 minutes of test time. just
running generic/753 by itself or concurrently with a limited group
of tests doesn't reproduce this issue at all.

Tracing on a hung system shows the AIL repeating every 50ms a log
force followed by an attempt to push pinned, aborted inodes from the
AIL (trimmed for brevity):

 xfs_log_force:   lsn 0x1c caller xfsaild+0x18e
 xfs_log_force:   lsn 0x0 caller xlog_cil_flush+0xbd
 xfs_log_force:   lsn 0x1c caller xfs_log_force+0x77
 xfs_ail_pinned:  lip 0xffff88826014afa0 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
 xfs_ail_pinned:  lip 0xffff88814000a708 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
 xfs_ail_pinned:  lip 0xffff88810b850c80 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
 xfs_ail_pinned:  lip 0xffff88810b850af0 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
 xfs_ail_pinned:  lip 0xffff888165cf0a28 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
 xfs_ail_pinned:  lip 0xffff88810b850bb8 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
 ....

The inode log items are marked as aborted, which means that either:

a) a transaction commit has occurred, seen an error or shutdown, and
called xfs_trans_free_items() to abort the items. This should happen
before any pinning of log items occurs.

or

b) a dirty transaction has been cancelled. This should also happen
before any pinning of log items occurs.

or

c) AIL insertion at journal IO completion is marked as aborted. In
this case, the log item is pinned by the CIL until journal IO
completes and hence needs to be unpinned. This is then done after
the -&gt;iop_committed() callback is run, so the pin count should be
balanced correctly.

Yet none of these seemed to be occurring. Further tracing indicated
this:

d) Shutdown during CIL pushing resulting in log item completion
being called from checkpoint abort processing. Items are unpinned
and released without serialisation against each other, journal IO
completion or transaction commit completion.

In this case, we may still have a transaction commit in flight that
holds a reference to a xfs_buf_log_item (BLI) after CIL insertion.
e.g. a synchronous transaction will flush the CIL before the
transaction is torn down.  The concurrent CIL push then aborts
insertion it and drops the commit/AIL reference to the BLI. This can
leave the transaction commit context with the last reference to the
BLI which is dropped here:

xfs_trans_free_items()
  -&gt;iop_release
    xfs_buf_item_release
      xfs_buf_item_put
        if (XFS_LI_ABORTED)
	  xfs_trans_ail_delete
	xfs_buf_item_relse()

Unlike the journal completion -&gt;iop_unpin path, this path does not
run stale buffer completion process when it drops the last
reference, hence leaving the stale inodes attached to the buffer
sitting the AIL. There are no other references to those inodes, so
there is no other mechanism to remove them from the AIL. Hence
unmount hangs.

The buffer lock context for stale buffers is passed to the last BLI
reference. This is normally the last BLI unpin on journal IO
completion. The unpin then processes the stale buffer completion and
releases the buffer lock.  However, if the final unpin from journal
IO completion (or CIL push abort) does not hold the last reference
to the BLI, there -must- still be a transaction context that
references the BLI, and so that context must perform the stale
buffer completion processing before the buffer is unlocked and the
BLI torn down.

The fix for this is to rework the xfs_buf_item_relse() path to run
stale buffer completion processing if it drops the last reference to
the BLI. We still hold the buffer locked, so the buffer owner and
lock context is the same as if we passed the BLI and buffer to the
-&gt;iop_unpin() context to finish stale process on journal commit.

However, we have to be careful here. In a shutdown state, we can be
freeing dirty BLIs from xfs_buf_item_put() via xfs_trans_brelse()
and xfs_trans_bdetach().  The existing code handles this case by
considering shutdown state as "aborted", but in doing so
largely masks the failure to clean up stale BLI state from the
xfs_buf_item_relse() path. i.e  regardless of the shutdown state and
whether the item is in the AIL, we must finish the stale buffer
cleanup if we are are dropping the last BLI reference from the
-&gt;iop_relse path in transaction commit context.

Signed-off-by: Dave Chinner &lt;dchinner@redhat.com&gt;
Reviewed-by: Carlos Maiolino &lt;cmaiolino@redhat.com&gt;
Signed-off-by: Carlos Maiolino &lt;cem@kernel.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
