summaryrefslogtreecommitdiff
path: root/fs/netfs
AgeCommit message (Collapse)Author
2026-05-12netfs, afs: Fix write skipping in dir/link writepagesDavid Howells
Fix netfs_write_single() and afs_single_writepages() to better handle a write that would be skipped due to lock contention and WB_SYNC_NONE by returning 1 from netfs_write_single() if it skipped and making afs_single_writepages() skip also. If a skip occurs, the inode must be re-marked as the VFS may have cleared the mark. This is really only theoretical for directories in netfs_write_single() as the only path to that is through afs_single_writepages() that takes the ->validate_lock around it, thereby serialising it. Fixes: 6dd80936618c ("afs: Use netfslib for directories") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-24-dhowells@redhat.com cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: Fix netfs_read_folio() to wait on writebackDavid Howells
Fix netfs_read_folio() to wait for an ongoing writeback to complete so that it can trust the dirty flag and whatever is attached to folio->private (folio->private may get cleaned up by the collector before it clears the writeback flag). Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-23-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: Fix folio->private handling in netfs_perform_write()David Howells
Under some circumstances, netfs_perform_write() doesn't correctly manipulate folio->private between NULL, NETFS_FOLIO_COPY_TO_CACHE, pointing to a group and pointing to a netfs_folio struct, leading to potential multiple attachments of private data with associated folio ref leaks and also leaks of netfs_folio structs or netfs_group refs. Fix this by consolidating the place at which a folio is marked uptodate in one place and having that look at what's attached to folio->private and decide how to clean it up and then set the new group. Also, the content shouldn't be flushed if group is NULL, even if a group is specified in the netfs_group parameter, as that would be the case for a new folio. A filesystem should always specify netfs_group or never specify netfs_group. The Sashiko auto-review tool noted that it was theoretically possible that the fpos >= ctx->zero_point section might leak if it modified a streaming write folio. This is unlikely, but with a network filesystem, third party changes can happen. It also pointed out that __netfs_set_group() would leak if called multiple times on the same folio from the "whole folio modify section". Fixes: 8f52de0077ba ("netfs: Reduce number of conditional branches in netfs_perform_write()") Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-22-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: Fix partial invalidation of streaming-write folioDavid Howells
In netfs_invalidate_folio(), if the region of a partial invalidation overlaps the front (but not all) of a dirty write cached in a streaming write page (dirty, but not uptodate, with the dirty region tracked by a netfs_folio struct), the function modifies the dirty region - but incorrectly as it moves the region forward by setting the start to the start, not the end, of the invalidation region. Fix this by setting finfo->dirty_offset to the end of the invalidation region (iend). Fixes: cce6bfa6ca0e ("netfs: Fix trimming of streaming-write folios in netfs_inval_folio()") Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-21-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: Fix potential UAF in netfs_unlock_abandoned_read_pages()David Howells
netfs_unlock_abandoned_read_pages(rreq) accesses the index of the folios it is wanting to unlock and compares that to rreq->no_unlock_folio so that it doesn't unlock a folio being read for netfs_perform_write() or netfs_write_begin(). However, given that netfs_unlock_abandoned_read_pages() is called _after_ NETFS_RREQ_IN_PROGRESS is cleared, the one folio that it's not allowed to dereference is the one specified by ->no_unlock_folio as ownership immediately reverts to the caller. Fix this by storing the folio pointer instead and using that rather than the index. Also fix netfs_unlock_read_folio() where the same applies. Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-20-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: Fix leak of request in netfs_write_begin() error handlingDavid Howells
Fix netfs_write_begin() to not leak our ref on the request in the event that we get an error from netfs_wait_for_read(). Fixes: 4090b31422a6 ("netfs: Add a function to consolidate beginning a read") Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-19-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: Fix early put of sink folio in netfs_read_gaps()David Howells
Fix netfs_read_gaps() to release the sink page it uses after waiting for the request to complete. The way the sink page is used is that an ITER_BVEC-class iterator is created that has the gaps from the target folio at either end, but has the sink page tiled over the middle so that a single read op can fill in both gaps. The bug was found by KASAN detecting a UAF on the generic/075 xfstest in the cifsd kernel thread that handles reception of data from the TCP socket: BUG: KASAN: use-after-free in _copy_to_iter+0x48a/0xa20 Write of size 885 at addr ffff888107f92000 by task cifsd/1285 CPU: 2 UID: 0 PID: 1285 Comm: cifsd Not tainted 7.0.0 #6 PREEMPT(lazy) Call Trace: dump_stack_lvl+0x5d/0x80 print_report+0x17f/0x4f1 kasan_report+0x100/0x1e0 kasan_check_range+0x10f/0x1e0 __asan_memcpy+0x3c/0x60 _copy_to_iter+0x48a/0xa20 __skb_datagram_iter+0x2c9/0x430 skb_copy_datagram_iter+0x6e/0x160 tcp_recvmsg_locked+0xce0/0x1130 tcp_recvmsg+0xeb/0x300 inet_recvmsg+0xcf/0x3a0 sock_recvmsg+0xea/0x100 cifs_readv_from_socket+0x3a6/0x4d0 [cifs] cifs_read_iter_from_socket+0xdd/0x130 [cifs] cifs_readv_receive+0xaad/0xb10 [cifs] cifs_demultiplex_thread+0x1148/0x1740 [cifs] kthread+0x1cf/0x210 Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Reported-by: Steve French <sfrench@samba.org> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-18-dhowells@redhat.com Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: Fix write streaming disablement if fd open O_RDWRDavid Howells
In netfs_perform_write(), "write streaming" (the caching of dirty data in dirty but !uptodate folios) is performed to avoid the need to read data that is just going to get immediately overwritten. However, this is/will be disabled in three circumstances: if the fd is open O_RDWR, if fscache is in use (as we need to round out the blocks for DIO) or if content encryption is enabled (again for rounding out purposes). The idea behind disabling it if the fd is open O_RDWR is that we'd need to flush the write-streaming page before we could read the data, particularly through mmap. But netfs now fills in the gaps if ->read_folio() is called on the page, so that is unnecessary. Further, this doesn't actually work if a separate fd is open for reading. Fix this by removing the check for O_RDWR, thereby allowing streaming writes even when we might read. This caused a number of problems with the generic/522 xfstest, but those are now fixed. Fixes: c38f4e96e605 ("netfs: Provide func to copy data to pagecache for buffered write") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-17-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: Fix read-gaps to remove netfs_folio from filled folioDavid Howells
Fix netfs_read_gaps() to remove the netfs_folio record from the folio record before marking the folio uptodate if it successfully fills the gaps around the dirty data in a streaming write folio (dirty, but not uptodate). Found with: fsx -q -N 1000000 -p 10000 -o 128000 -l 600000 \ /xfstest.test/junk --replay-ops=junk.fsxops using the following as junk.fsxops: truncate 0x0 0x138b1 0x8b15d * write 0x507ee 0x10df7 0x927c0 write 0x19993 0x10e04 0x927c0 * mapwrite 0x66214 0x1a253 0x927c0 copy_range 0xb704 0x89b9 0x24429 0x79380 write 0x2402b 0x144a2 0x90660 * mapwrite 0x204d5 0x140a0 0x927c0 * copy_range 0x1f72c 0x137d0 0x7a906 0x927c0 * read 0 0x9157c 0x9157c on cifs with the default cache option. It shows folio 0x24 misbehaving if the FMODE_READ check is commented out in netfs_perform_write(): if (//(file->f_mode & FMODE_READ) || netfs_is_cache_enabled(ctx)) { and no fscache. This was initially found with the generic/522 xfstest. Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-16-dhowells@redhat.com Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: Fix potential deadlock in write-through modeDavid Howells
Fix netfs_advance_writethrough() to always unlock the supplied folio and to mark it dirty if it isn't yet written to the end. Unfortunately, it can't be marked for writeback until the folio is done with as that may cause a deadlock against mmapped reads and writes. Even though it has been marked dirty, premature writeback can't occur as the caller is holding both inode->i_rwsem (which will prevent concurrent truncation, fallocation, DIO and other writes) and ictx->wb_lock (which will cause flushing to wait and writeback to skip or wait). Note that this may be easier to deal with once the queuing of folios is split from the generation of subrequests. Fixes: 288ace2f57c9 ("netfs: New writeback implementation") Closes: https://sashiko.dev/#/patchset/20260427154639.180684-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-15-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: Fix streaming write being overwrittenDavid Howells
In order to avoid reading whilst writing, netfslib will allow "streaming writes" in which dirty data is stored directly into folios without reading them first. Such folios are marked dirty but may not be marked uptodate. If a folio is entirely written by a streaming write, uptodate will be set, otherwise it will have a netfs_folio struct attached to ->private recording the dirty region. In the event that a partially written streaming write page is to be overwritten entirely by a single write(), netfs_perform_write() will try to copy over it, but doesn't discard the netfs_folio if it succeeds; further, it doesn't correctly handle a partial copy that overwrites some of the dirty data. Fix this by the following: (1) If the folio is successfully overwritten, free the netfs_folio struct before marking the page uptodate. (2) If the copy to the folio partially fails, but short of the dirty data, just ignore the copy. (3) If the copy partially fails and overwrites some of the dirty data, accept the copy, update the netfs_folio struct to record the new data. If the folio is now filled, free the netfs_folio and set uptodate, otherwise return a partial write. Found with: fsx -q -N 1000000 -p 10000 -o 128000 -l 600000 \ /xfstest.test/junk --replay-ops=junk.fsxops using the following as junk.fsxops: truncate 0x0 0 0x927c0 write 0x63fb8 0x53c8 0 copy_range 0xb704 0x19b9 0x24429 0x79380 write 0x2402b 0x144a2 0x90660 * write 0x204d5 0x140a0 0x927c0 * copy_range 0x1f72c 0x137d0 0x7a906 0x927c0 * read 0x00000 0x20000 0x9157c read 0x20000 0x20000 0x9157c read 0x40000 0x20000 0x9157c read 0x60000 0x20000 0x9157c read 0x7e1a0 0xcfb9 0x9157c on cifs with the default cache option. It shows folio 0x24 misbehaving if the FMODE_READ check is commented out in netfs_perform_write(): if (//(file->f_mode & FMODE_READ) || netfs_is_cache_enabled(ctx)) { and no fscache. This was initially found with the generic/522 xfstest. Fixes: 8f52de0077ba ("netfs: Reduce number of conditional branches in netfs_perform_write()") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-14-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: Defer the emission of trace_netfs_folio()David Howells
Change netfs_perform_write() to keep the netfs_folio trace value in a variable and emit it later to make it easier to choose the value displayed. This is a prerequisite for a subsequent patch. Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-13-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: Fix netfs_invalidate_folio() to clear dirty bit if all changes goneDavid Howells
If a streaming write is made, this will leave the relevant modified folio in a not-uptodate, but dirty state with a netfs_folio struct hung off of folio->private indicating the dirty range. Subsequently truncating the file such that the dirty data in the folio is removed, but the first part of the folio theoretically remains will cause the netfs_folio struct to be discarded... but will leave the dirty flag set. If the folio is then read via mmap(), netfs_read_folio() will see that the page is dirty and jump to netfs_read_gaps() to fill in the missing bits. netfs_read_gaps(), however, expects there to be a netfs_folio struct present and can oops because truncate removed it. Fix this by calling folio_cancel_dirty() in netfs_invalidate_folio() in the event that all the dirty data in the folio is erased (as nfs does). Also add some tracepoints to log modifications to a dirty page. This can be reproduced with something like: dd if=/dev/zero of=/xfstest.test/foo bs=1M count=1 umount /xfstest.test mount /xfstest.test xfs_io -c "w 0xbbbf 0xf96c" \ -c "truncate 0xbbbf" \ -c "mmap -r 0xb000 0x11000" \ -c "mr 0xb000 0x11000" \ /xfstest.test/foo with fscaching disabled (otherwise streaming writes are suppressed) and a change to netfs_perform_write() to disallow streaming writes if the fd is open O_RDWR: if (//(file->f_mode & FMODE_READ) || <--- comment this out netfs_is_cache_enabled(ctx)) { It should be reproducible even without this change, but if prevents the above trivial xfs_io command from reproducing it. Note that the initial dd is important: the file must start out sufficiently large that the zero-point logic doesn't just clear the gaps because it knows there's nothing in the file to read yet. Unmounting and mounting is needed to clear the pagecache (there are other ways to do that that may also work). This was initially reproduced with the generic/522 xfstest on some patches that remove the FMODE_READ restriction. Fixes: 9ebff83e6481 ("netfs: Prep to use folio->private for write grouping and streaming write") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-12-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: Fix overrun check in netfs_extract_user_iter()David Howells
Fix netfs_extract_user_iter() so that if iov_iter_extract_pages() overfills pages[], then those pages don't get included in the iterator constructed at the end of the function. If there was an overfill, memory corruption has already happened. Fixes: 85dd2c8ff368 ("netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator") Closes: https://sashiko.dev/#/patchset/20260427154639.180684-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-11-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: fix error handling in netfs_extract_user_iter()Paulo Alcantara
In netfs_extract_user_iter(), if iov_iter_extract_pages() failed to extract user pages, bail out on -ENOMEM, otherwise return the error code only if @npages == 0, allowing short DIO reads and writes to be issued. This fixes mmapstress02 from LTP tests against CIFS. Fixes: 85dd2c8ff368 ("netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator") Reported-by: Xiaoli Feng <xifeng@redhat.com> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-10-dhowells@redhat.com Cc: netfs@lists.linux.dev Cc: stable@vger.kernel.org Cc: linux-cifs@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: Fix potential uninitialised var in netfs_extract_user_iter()David Howells
In netfs_extract_user_iter(), if it's given a zero-length iterator, it will fall through the loop without setting ret, and so the error handling behaviour will be undefined, depending on whether ret happens to be negative. The value of ret then propagates back up the callstack. Fix this by presetting ret to 0. Fixes: 85dd2c8ff368 ("netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator") Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-9-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: fix VM_BUG_ON_FOLIO() issue in netfs_write_begin() callViacheslav Dubeyko
The multiple runs of generic/013 test-case is capable to reproduce a kernel BUG at mm/filemap.c:1504 with probability of 30%. while true; do sudo ./check generic/013 done [ 9849.452376] page: refcount:3 mapcount:0 mapping:00000000e58ff252 index:0x10781 pfn:0x1c322 [ 9849.452412] memcg:ffff8881a1915800 [ 9849.452417] aops:ceph_aops ino:1000058db9e dentry name(?):"f9XXXXXX" [ 9849.452432] flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff) [ 9849.452441] raw: 0017ffffc0000000 0000000000000000 dead000000000122 ffff88816110d248 [ 9849.452445] raw: 0000000000010781 0000000000000000 00000003ffffffff ffff8881a1915800 [ 9849.452447] page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio)) [ 9849.452474] ------------[ cut here ]------------ [ 9849.452476] kernel BUG at mm/filemap.c:1504! [ 9849.478635] Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI [ 9849.481772] CPU: 2 UID: 0 PID: 84223 Comm: fsstress Not tainted 7.0.0-rc1+ #18 PREEMPT(full) [ 9849.482881] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-9.fc43 06/1 0/2025 [ 9849.484539] RIP: 0010:folio_unlock+0x85/0xa0 [ 9849.485076] Code: 89 df 31 f6 e8 1c f3 ff ff 48 8b 5d f8 c9 31 c0 31 d2 31 f6 31 ff c3 cc cc cc cc 48 c7 c6 80 6c d9 a7 48 89 df e8 4b b3 10 00 <0f> 0b 48 89 df e8 21 e6 2c 00 eb 9d 0f 1f 40 00 66 66 2e 0f 1f 84 [ 9849.493818] RSP: 0018:ffff8881bb8076b0 EFLAGS: 00010246 [ 9849.495740] RAX: 0000000000000000 RBX: ffffea00070c8980 RCX: 0000000000000000 [ 9849.498678] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 9849.500559] RBP: ffff8881bb8076b8 R08: 0000000000000000 R09: 0000000000000000 [ 9849.501097] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000010782000 [ 9849.502108] R13: ffff8881935de738 R14: ffff88816110d010 R15: 0000000000001000 [ 9849.502516] FS: 00007e36cbe94740(0000) GS:ffff88824a899000(0000) knlGS:0000000000000000 [ 9849.502996] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9849.503810] CR2: 000000c0002b0000 CR3: 000000011bbf6004 CR4: 0000000000772ef0 [ 9849.504459] PKRU: 55555554 [ 9849.504626] Call Trace: [ 9849.505242] <TASK> [ 9849.505379] netfs_write_begin+0x7c8/0x10a0 [ 9849.505877] ? __kasan_check_read+0x11/0x20 [ 9849.506384] ? __pfx_netfs_write_begin+0x10/0x10 [ 9849.507178] ceph_write_begin+0x8c/0x1c0 [ 9849.507934] generic_perform_write+0x391/0x8f0 [ 9849.508503] ? __pfx_generic_perform_write+0x10/0x10 [ 9849.509062] ? file_update_time_flags+0x19a/0x4b0 [ 9849.509581] ? ceph_get_caps+0x63/0xf0 [ 9849.510259] ? ceph_get_caps+0x63/0xf0 [ 9849.510530] ceph_write_iter+0xe79/0x1ae0 [ 9849.511282] ? __pfx_ceph_write_iter+0x10/0x10 [ 9849.511839] ? lock_acquire+0x1ad/0x310 [ 9849.512334] ? ksys_write+0xf9/0x230 [ 9849.512582] ? lock_is_held_type+0xaa/0x140 [ 9849.513128] vfs_write+0x512/0x1110 [ 9849.513634] ? __fget_files+0x33/0x350 [ 9849.513893] ? __pfx_vfs_write+0x10/0x10 [ 9849.514143] ? mutex_lock_nested+0x1b/0x30 [ 9849.514394] ksys_write+0xf9/0x230 [ 9849.514621] ? __pfx_ksys_write+0x10/0x10 [ 9849.514887] ? do_syscall_64+0x25e/0x1520 [ 9849.515122] ? __kasan_check_read+0x11/0x20 [ 9849.515366] ? trace_hardirqs_on_prepare+0x178/0x1c0 [ 9849.515655] __x64_sys_write+0x72/0xd0 [ 9849.515885] ? trace_hardirqs_on+0x24/0x1c0 [ 9849.516130] x64_sys_call+0x22f/0x2390 [ 9849.516341] do_syscall_64+0x12b/0x1520 [ 9849.516545] ? do_syscall_64+0x27c/0x1520 [ 9849.516783] ? do_syscall_64+0x27c/0x1520 [ 9849.517003] ? lock_release+0x318/0x480 [ 9849.517220] ? __x64_sys_io_getevents+0x143/0x2d0 [ 9849.517479] ? percpu_ref_put_many.constprop.0+0x8f/0x210 [ 9849.517779] ? entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 9849.518073] ? do_syscall_64+0x25e/0x1520 [ 9849.518291] ? __kasan_check_read+0x11/0x20 [ 9849.518519] ? trace_hardirqs_on_prepare+0x178/0x1c0 [ 9849.518799] ? do_syscall_64+0x27c/0x1520 [ 9849.519024] ? local_clock_noinstr+0xf/0x120 [ 9849.519262] ? entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 9849.519544] ? do_syscall_64+0x25e/0x1520 [ 9849.519781] ? __kasan_check_read+0x11/0x20 [ 9849.520008] ? trace_hardirqs_on_prepare+0x178/0x1c0 [ 9849.520273] ? do_syscall_64+0x27c/0x1520 [ 9849.520491] ? trace_hardirqs_on_prepare+0x178/0x1c0 [ 9849.520767] ? irqentry_exit+0x10c/0x6c0 [ 9849.520984] ? trace_hardirqs_off+0x86/0x1b0 [ 9849.521224] ? exc_page_fault+0xab/0x130 [ 9849.521472] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 9849.521766] RIP: 0033:0x7e36cbd14907 [ 9849.521989] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 [ 9849.523057] RSP: 002b:00007ffff2d2a968 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 9849.523484] RAX: ffffffffffffffda RBX: 000000000000e549 RCX: 00007e36cbd14907 [ 9849.523885] RDX: 000000000000e549 RSI: 00005bd797ec6370 RDI: 0000000000000004 [ 9849.524277] RBP: 0000000000000004 R08: 0000000000000047 R09: 00005bd797ec6370 [ 9849.524652] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000049 [ 9849.525062] R13: 0000000010781a37 R14: 00005bd797ec6370 R15: 0000000000000000 [ 9849.525447] </TASK> [ 9849.525574] Modules linked in: intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core pmt_telemetry pmt_discovery pmt_class intel_pmc_ssram_telemetry intel_vsec kvm_intel joydev kvm irqbypass ghash_clmulni_intel aesni_intel input_leds rapl mac_hid psmouse vga16fb serio_raw vgastate floppy i2c_piix4 bochs qemu_fw_cfg i2c_smbus pata_acpi sch_fq_codel rbd msr parport_pc ppdev lp parport efi_pstore [ 9849.529150] ---[ end trace 0000000000000000 ]--- [ 9849.529502] RIP: 0010:folio_unlock+0x85/0xa0 [ 9849.530813] Code: 89 df 31 f6 e8 1c f3 ff ff 48 8b 5d f8 c9 31 c0 31 d2 31 f6 31 ff c3 cc cc cc cc 48 c7 c6 80 6c d9 a7 48 89 df e8 4b b3 10 00 <0f> 0b 48 89 df e8 21 e6 2c 00 eb 9d 0f 1f 40 00 66 66 2e 0f 1f 84 [ 9849.534986] RSP: 0018:ffff8881bb8076b0 EFLAGS: 00010246 [ 9849.536198] RAX: 0000000000000000 RBX: ffffea00070c8980 RCX: 0000000000000000 [ 9849.537718] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 9849.539321] RBP: ffff8881bb8076b8 R08: 0000000000000000 R09: 0000000000000000 [ 9849.540862] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000010782000 [ 9849.542438] R13: ffff8881935de738 R14: ffff88816110d010 R15: 0000000000001000 [ 9849.543996] FS: 00007e36cbe94740(0000) GS:ffff88824b899000(0000) knlGS:0000000000000000 [ 9849.545854] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9849.547092] CR2: 00007e36cb3ff000 CR3: 000000011bbf6006 CR4: 0000000000772ef0 [ 9849.548679] PKRU: 55555554 The race sequence: 1. Read completes -> netfs_read_collection() runs 2. netfs_wake_rreq_flag(rreq, NETFS_RREQ_IN_PROGRESS, ...) 3. netfs_wait_for_read() returns -EFAULT to netfs_write_begin() 4. The netfs_unlock_abandoned_read_pages() unlocks the folio 5. netfs_write_begin() calls folio_unlock(folio) -> VM_BUG_ON_FOLIO() The key reason of the issue that netfs_unlock_abandoned_read_pages() doesn't check the flag NETFS_RREQ_NO_UNLOCK_FOLIO and executes folio_unlock() unconditionally. This patch implements in netfs_unlock_abandoned_read_pages() logic similar to netfs_unlock_read_folio(). Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-8-dhowells@redhat.com Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: Ceph Development <ceph-devel@vger.kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: Fix zeropoint update where i_size > remote_i_sizeDavid Howells
Fix the update of the zero point[*] by netfs_release_folio() when there is uncommitted data in the pagecache beyond the folio being released but the on-server EOF is in this folio (ie. i_size > remote_i_size). The update needs to limit zero_point to remote_i_size, not i_size as i_size is a local phenomenon reflecting updates made locally to the pagecache, not stuff written to the server. remote_i_size tracks the server's i_size. [*] The zero point is the file position from which we can assume that the server will just return zeros, so we can avoid generating reads. Note that netfs_invalidate_folio() probably doesn't need fixing as zero_point should be updated by setattr after truncation or fallocate. Found with: fsx -q -N 1000000 -p 10000 -o 128000 -l 600000 \ /xfstest.test/junk --replay-ops=junk.fsxops using the following as junk.fsxops: truncate 0x0 0x1bbae 0x82864 write 0x3ef2e 0xf9c8 0x1bbae write 0x67e05 0xcb5a 0x4e8f6 mapread 0x57781 0x85b6 0x7495f copy_range 0x5d3d 0x10329 0x54fac 0x7495f write 0x64710 0x1c2b 0x7495f mapread 0x64000 0x1000 0x7495f on cifs with the default cache option. It shows read-gaps on folio 0x64 failing with a short read (ie. it hits EOF) if the FMODE_READ check is commented out in netfs_perform_write(): if (//(file->f_mode & FMODE_READ) || netfs_is_cache_enabled(ctx)) { and no fscache. This was initially found with the generic/522 xfstest. Fixes: cce6bfa6ca0e ("netfs: Fix trimming of streaming-write folios in netfs_inval_folio()") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-7-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: Fix potential for tearing in ->remote_i_size and ->zero_pointDavid Howells
Fix potential tearing in using ->remote_i_size and ->zero_point by copying i_size_read() and i_size_write() and using the same seqcount as for i_size. We need to make sure that netfslib and the filesystems that use it always hold i_lock whilst updating any of the sizes to prevent i_size_seqcount from getting corrupted. Fixes: 4058f742105e ("netfs: Keep track of the actual remote file size") Fixes: 100ccd18bb41 ("netfs: Optimise away reads above the point at which there can be no data") Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-6-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: Fix netfs_read_to_pagecache() to pause on subreq failureDavid Howells
Fix netfs_read_to_pagecache() so that it pauses the generation of new subrequests if an already-issued subrequest fails. Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Closes: https://sashiko.dev/#/patchset/20260425125426.3855807-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-5-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: Fix missing barriers when accessing stream->subrequests locklesslyDavid Howells
The list of subrequests attached to stream->subrequests is accessed without locks by netfs_collect_read_results() and netfs_collect_write_results(), and then they access subreq->flags without taking a barrier after getting the subreq pointer from the list. Relatedly, the functions that build the list don't use any sort of write barrier when constructing the list to make sure that the NETFS_SREQ_IN_PROGRESS flag is perceived to be set first if no lock is taken. Fix this by: (1) Add a new list_add_tail_release() function that uses a release barrier to set the pointer to the new member of the list. (2) Add a new list_first_entry_or_null_acquire() function that uses an acquire barrier to read the pointer to the first member in a list (or return NULL). (3) Use list_add_tail_release() when adding a subreq to ->subrequests. (4) Use list_first_entry_or_null_acquire() when initially accessing the front of the list (when an item is removed, the pointer to the new front iterm is obtained under the same lock). Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Fixes: 288ace2f57c9 ("netfs: New writeback implementation") Link: https://sashiko.dev/#/patchset/20260326104544.509518-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-4-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: Fix missing locking around retry adding new subreqsDavid Howells
Fix netfs_retry_read_subrequests() and netfs_retry_write_stream() to take the appropriate lock when adding extra subrequests into stream->subrequests. Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Fixes: 288ace2f57c9 ("netfs: New writeback implementation") Closes: https://sashiko.dev/#/patchset/20260425125426.3855807-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-3-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-05-12netfs: Fix cancellation of a DIO and single read subrequestsDavid Howells
When the preparation of a new subrequest for a read fails, if the subrequest has already been added to the stream->subrequests list, it can't simply be put and abandoned as the collector may see it. Also, if it hasn't been queued yet, it has two outstanding refs that both need to be put. Both DIO read and single-read dispatch fail at this; further, both differ in the order they do things to the way buffered read works. Fix cancellation of both DIO-read and single-read subrequests that failed preparation by the following steps: (1) Harmonise all three reads (buffered, dio, single) to queue the subreq before prepping it. (2) Make all three call netfs_queue_read() to do the queuing. (3) Set NETFS_RREQ_ALL_QUEUED independently of the queuing as we don't know the length of the subreq at this point. (4) In all cases, set the error and NETFS_SREQ_FAILED flag on the subreq and then call netfs_read_subreq_terminated() to deal with it. This will pass responsibility off to the collector for dealing with it. Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Closes: https://sashiko.dev/#/patchset/20260425125426.3855807-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20260512123404.719402-2-dhowells@redhat.com cc: Paulo Alcantara <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-04-15Merge tag 'mm-stable-2026-04-13-21-45' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "maple_tree: Replace big node with maple copy" (Liam Howlett) Mainly prepararatory work for ongoing development but it does reduce stack usage and is an improvement. - "mm, swap: swap table phase III: remove swap_map" (Kairui Song) Offers memory savings by removing the static swap_map. It also yields some CPU savings and implements several cleanups. - "mm: memfd_luo: preserve file seals" (Pratyush Yadav) File seal preservation to LUO's memfd code - "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan Chen) Additional userspace stats reportng to zswap - "arch, mm: consolidate empty_zero_page" (Mike Rapoport) Some cleanups for our handling of ZERO_PAGE() and zero_pfn - "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu Han) A robustness improvement and some cleanups in the kmemleak code - "Improve khugepaged scan logic" (Vernon Yang) Improve khugepaged scan logic and reduce CPU consumption by prioritizing scanning tasks that access memory frequently - "Make KHO Stateless" (Jason Miu) Simplify Kexec Handover by transitioning KHO from an xarray-based metadata tracking system with serialization to a radix tree data structure that can be passed directly to the next kernel - "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas Ballasi and Steven Rostedt) Enhance vmscan's tracepointing - "mm: arch/shstk: Common shadow stack mapping helper and VM_NOHUGEPAGE" (Catalin Marinas) Cleanup for the shadow stack code: remove per-arch code in favour of a generic implementation - "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin) Fix a WARN() which can be emitted the KHO restores a vmalloc area - "mm: Remove stray references to pagevec" (Tal Zussman) Several cleanups, mainly udpating references to "struct pagevec", which became folio_batch three years ago - "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl Shutsemau) Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail pages encode their relationship to the head page - "mm/damon/core: improve DAMOS quota efficiency for core layer filters" (SeongJae Park) Improve two problematic behaviors of DAMOS that makes it less efficient when core layer filters are used - "mm/damon: strictly respect min_nr_regions" (SeongJae Park) Improve DAMON usability by extending the treatment of the min_nr_regions user-settable parameter - "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka) The proper fix for a previously hotfixed SMP=n issue. Code simplifications and cleanups ensued - "mm: cleanups around unmapping / zapping" (David Hildenbrand) A bunch of cleanups around unmapping and zapping. Mostly simplifications, code movements, documentation and renaming of zapping functions - "support batched checking of the young flag for MGLRU" (Baolin Wang) Batched checking of the young flag for MGLRU. It's part cleanups; one benchmark shows large performance benefits for arm64 - "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner) memcg cleanup and robustness improvements - "Allow order zero pages in page reporting" (Yuvraj Sakshith) Enhance free page reporting - it is presently and undesirably order-0 pages when reporting free memory. - "mm: vma flag tweaks" (Lorenzo Stoakes) Cleanup work following from the recent conversion of the VMA flags to a bitmap - "mm/damon: add optional debugging-purpose sanity checks" (SeongJae Park) Add some more developer-facing debug checks into DAMON core - "mm/damon: test and document power-of-2 min_region_sz requirement" (SeongJae Park) An additional DAMON kunit test and makes some adjustments to the addr_unit parameter handling - "mm/damon/core: make passed_sample_intervals comparisons overflow-safe" (SeongJae Park) Fix a hard-to-hit time overflow issue in DAMON core - "mm/damon: improve/fixup/update ratio calculation, test and documentation" (SeongJae Park) A batch of misc/minor improvements and fixups for DAMON - "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c" (David Hildenbrand) Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code movement was required. - "zram: recompression cleanups and tweaks" (Sergey Senozhatsky) A somewhat random mix of fixups, recompression cleanups and improvements in the zram code - "mm/damon: support multiple goal-based quota tuning algorithms" (SeongJae Park) Extend DAMOS quotas goal auto-tuning to support multiple tuning algorithms that users can select - "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao) Fix the khugpaged sysfs handling so we no longer spam the logs with reams of junk when starting/stopping khugepaged - "mm: improve map count checks" (Lorenzo Stoakes) Provide some cleanups and slight fixes in the mremap, mmap and vma code - "mm/damon: support addr_unit on default monitoring targets for modules" (SeongJae Park) Extend the use of DAMON core's addr_unit tunable - "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache) Cleanups to khugepaged and is a base for Nico's planned khugepaged mTHP support - "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand) Code movement and cleanups in the memhotplug and sparsemem code - "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup CONFIG_MIGRATION" (David Hildenbrand) Rationalize some memhotplug Kconfig support - "change young flag check functions to return bool" (Baolin Wang) Cleanups to change all young flag check functions to return bool - "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh Law and SeongJae Park) Fix a few potential DAMON bugs - "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo Stoakes) Convert a lot of the existing use of the legacy vm_flags_t data type to the new vma_flags_t type which replaces it. Mainly in the vma code. - "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes) Expand the mmap_prepare functionality, which is intended to replace the deprecated f_op->mmap hook which has been the source of bugs and security issues for some time. Cleanups, documentation, extension of mmap_prepare into filesystem drivers - "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes) Simplify and clean up zap_huge_pmd(). Additional cleanups around vm_normal_folio_pmd() and the softleaf functionality are performed. * tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm: fix deferred split queue races during migration mm/khugepaged: fix issue with tracking lock mm/huge_memory: add and use has_deposited_pgtable() mm/huge_memory: add and use normal_or_softleaf_folio_pmd() mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio() mm/huge_memory: separate out the folio part of zap_huge_pmd() mm/huge_memory: use mm instead of tlb->mm mm/huge_memory: remove unnecessary sanity checks mm/huge_memory: deduplicate zap deposited table call mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE() mm/huge_memory: add a common exit path to zap_huge_pmd() mm/huge_memory: handle buggy PMD entry in zap_huge_pmd() mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc mm/huge: avoid big else branch in zap_huge_pmd() mm/huge_memory: simplify vma_is_specal_huge() mm: on remap assert that input range within the proposed VMA mm: add mmap_action_map_kernel_pages[_full]() uio: replace deprecated mmap hook with mmap_prepare in uio_info drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare mm: allow handling of stacked mmap_prepare hooks in more drivers ...
2026-04-05fs: remove unncessary pagevec.h includesTal Zussman
Remove unused pagevec.h includes from .c files. These were found with the following command: grep -rl '#include.*pagevec\.h' --include='*.c' | while read f; do grep -qE 'PAGEVEC_SIZE|folio_batch' "$f" || echo "$f" done There are probably more removal candidates in .h files, but those are more complex to analyze. Link: https://lkml.kernel.org/r/20260225-pagevec_cleanup-v2-2-716868cc2d11@columbia.edu Signed-off-by: Tal Zussman <tz2294@columbia.edu> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-26netfs: Fix the handling of stream->front by removing itDavid Howells
The netfs_io_stream::front member is meant to point to the subrequest currently being collected on a stream, but it isn't actually used this way by direct write (which mostly ignores it). However, there's a tracepoint which looks at it. Further, stream->front is actually redundant with stream->subrequests.next. Fix the potential problem in the direct code by just removing the member and using stream->subrequests.next instead, thereby also simplifying the code. Fixes: a0b4c7a49137 ("netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict sequence") Reported-by: Paulo Alcantara <pc@manguebit.org> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/4158599.1774426817@warthog.procyon.org.uk Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-19netfs: Fix read abandonment during retryDavid Howells
Under certain circumstances, all the remaining subrequests from a read request will get abandoned during retry. The abandonment process expects the 'subreq' variable to be set to the place to start abandonment from, but it doesn't always have a useful value (it will be uninitialised on the first pass through the loop and it may point to a deleted subrequest on later passes). Fix the first jump to "abandon:" to set subreq to the start of the first subrequest expected to need retry (which, in this abandonment case, turned out unexpectedly to no longer have NEED_RETRY set). Also clear the subreq pointer after discarding superfluous retryable subrequests to cause an oops if we do try to access it. Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/3775287.1773848338@warthog.procyon.org.uk Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> cc: Paulo Alcantara <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-09netfs: Fix NULL pointer dereference in netfs_unbuffered_write() on retryDeepanshu Kartikey
When a write subrequest is marked NETFS_SREQ_NEED_RETRY, the retry path in netfs_unbuffered_write() unconditionally calls stream->prepare_write() without checking if it is NULL. Filesystems such as 9P do not set the prepare_write operation, so stream->prepare_write remains NULL. When get_user_pages() fails with -EFAULT and the subrequest is flagged for retry, this results in a NULL pointer dereference at fs/netfs/direct_write.c:189. Fix this by mirroring the pattern already used in write_retry.c: if stream->prepare_write is NULL, skip renegotiation and directly reissue the subrequest via netfs_reissue_write(), which handles iterator reset, IN_PROGRESS flag, stats update and reissue internally. Fixes: a0b4c7a49137 ("netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict sequence") Reported-by: syzbot+7227db0fbac9f348dba0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=7227db0fbac9f348dba0 Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com> Link: https://patch.msgid.link/20260307043947.347092-1-kartikey406@gmail.com Tested-by: syzbot+7227db0fbac9f348dba0@syzkaller.appspotmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-09netfs: Fix kernel BUG in netfs_limit_iter() for ITER_KVEC iteratorsDeepanshu Kartikey
When a process crashes and the kernel writes a core dump to a 9P filesystem, __kernel_write() creates an ITER_KVEC iterator. This iterator reaches netfs_limit_iter() via netfs_unbuffered_write(), which only handles ITER_FOLIOQ, ITER_BVEC and ITER_XARRAY iterator types, hitting the BUG() for any other type. Fix this by adding netfs_limit_kvec() following the same pattern as netfs_limit_bvec(), since both kvec and bvec are simple segment arrays with pointer and length fields. Dispatch it from netfs_limit_iter() when the iterator type is ITER_KVEC. Fixes: cae932d3aee5 ("netfs: Add func to calculate pagecount/size-limited span of an iterator") Reported-by: syzbot+9c058f0d63475adc97fd@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=9c058f0d63475adc97fd Tested-by: syzbot+9c058f0d63475adc97fd@syzkaller.appspotmail.com Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com> Link: https://patch.msgid.link/20260307090041.359870-1-kartikey406@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-02-26netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict sequenceDavid Howells
Fix netfslib such that when it's making an unbuffered or DIO write, to make sure that it sends each subrequest strictly sequentially, waiting till the previous one is 'committed' before sending the next so that we don't have pieces landing out of order and potentially leaving a hole if an error occurs (ENOSPC for example). This is done by copying in just those bits of issuing, collecting and retrying subrequests that are necessary to do one subrequest at a time. Retrying, in particular, is simpler because if the current subrequest needs retrying, the source iterator can just be copied again and the subrequest prepped and issued again without needing to be concerned about whether it needs merging with the previous or next in the sequence. Note that the issuing loop waits for a subrequest to complete right after issuing it, but this wait could be moved elsewhere allowing preparatory steps to be performed whilst the subrequest is in progress. In particular, once content encryption is available in netfslib, that could be done whilst waiting, as could cleanup of buffers that have been completed. Fixes: 153a9961b551 ("netfs: Implement unbuffered/DIO write support") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/58526.1772112753@warthog.procyon.org.uk Tested-by: Steve French <sfrench@samba.org> Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-02-21Convert 'alloc_flex' family to use the new default GFP_KERNEL argumentLinus Torvalds
This is the exact same thing as the 'alloc_obj()' version, only much smaller because there are a lot fewer users of the *alloc_flex() interface. As with alloc_obj() version, this was done entirely with mindless brute force, using the same script, except using 'flex' in the pattern rather than 'objs*'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-08netfs: avoid double increment of retry_count in subreqShyam Prasad N
This change fixes the instance of double incrementing of retry_count. The increment of this count already happens when netfs_reissue_write gets called. Incrementing this value before is not necessary. Fixes: 4acb665cf4f3 ("netfs: Work around recursion by abandoning retry if nothing read") Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2026-02-08netfs: when subreq is marked for retry, do not check if it faced an errorShyam Prasad N
The *_subreq_terminated functions today only process the NEED_RETRY flag when the subreq was successful or failed with EAGAIN error. However, there could be other retriable errors for network filesystems. Avoid this by processing the NEED_RETRY irrespective of the error code faced by the subreq. If it was specifically marked for retry, the error code must not matter. Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-12-24netfs: Fix early read unlock of page with EOF in middleDavid Howells
The read result collection for buffered reads seems to run ahead of the completion of subrequests under some circumstances, as can be seen in the following log snippet: 9p_client_res: client 18446612686390831168 response P9_TREAD tag 0 err 0 ... netfs_sreq: R=00001b55[1] DOWN TERM f=192 s=0 5fb2/5fb2 s=5 e=0 ... netfs_collect_folio: R=00001b55 ix=00004 r=4000-5000 t=4000/5fb2 netfs_folio: i=157f3 ix=00004-00004 read-done netfs_folio: i=157f3 ix=00004-00004 read-unlock netfs_collect_folio: R=00001b55 ix=00005 r=5000-5fb2 t=5000/5fb2 netfs_folio: i=157f3 ix=00005-00005 read-done netfs_folio: i=157f3 ix=00005-00005 read-unlock ... netfs_collect_stream: R=00001b55[0:] cto=5fb2 frn=ffffffff netfs_collect_state: R=00001b55 col=5fb2 cln=6000 n=c netfs_collect_stream: R=00001b55[0:] cto=5fb2 frn=ffffffff netfs_collect_state: R=00001b55 col=5fb2 cln=6000 n=8 ... netfs_sreq: R=00001b55[2] ZERO SUBMT f=000 s=5fb2 0/4e s=0 e=0 netfs_sreq: R=00001b55[2] ZERO TERM f=102 s=5fb2 4e/4e s=5 e=0 The 'cto=5fb2' indicates the collected file pos we've collected results to so far - but we still have 0x4e more bytes to go - so we shouldn't have collected folio ix=00005 yet. The 'ZERO' subreq that clears the tail happens after we unlock the folio, allowing the application to see the uncleared tail through mmap. The problem is that netfs_read_unlock_folios() will unlock a folio in which the amount of read results collected hits EOF position - but the ZERO subreq lies beyond that and so happens after. Fix this by changing the end check to always be the end of the folio and never the end of the file. In the future, I should look at clearing to the end of the folio here rather than adding a ZERO subreq to do this. On the other hand, the ZERO subreq can run in parallel with an async READ subreq. Further, the ZERO subreq may still be necessary to, say, handle extents in a ceph file that don't have any backing store and are thus implicitly all zeros. This can be reproduced by creating a file, the size of which doesn't align to a page boundary, e.g. 24998 (0x5fb2) bytes and then doing something like: xfs_io -c "mmap -r 0 0x6000" -c "madvise -d 0 0x6000" \ -c "mread -v 0 0x6000" /xfstest.test/x The last 0x4e bytes should all be 00, but if the tail hasn't been cleared yet, you may see rubbish there. This can be reproduced with kafs by modifying the kernel to disable the call to netfs_read_subreq_progress() and to stop afs_issue_read() from doing the async call for NETFS_READAHEAD. Reproduction can be made easier by inserting an mdelay(100) in netfs_issue_read() for the ZERO-subreq case. AFS and CIFS are normally unlikely to show this as they dispatch READ ops asynchronously, which allows the ZERO-subreq to finish first. 9P's READ op is completely synchronous, so the ZERO-subreq will always happen after. It isn't seen all the time, though, because the collection may be done in a worker thread. Reported-by: Christian Schoenebeck <linux_oss@crudebyte.com> Link: https://lore.kernel.org/r/8622834.T7Z3S40VBb@weasel/ Signed-off-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/938162.1766233900@warthog.procyon.org.uk Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Tested-by: Christian Schoenebeck <linux_oss@crudebyte.com> Acked-by: Dominique Martinet <asmadeus@codewreck.org> Suggested-by: Dominique Martinet <asmadeus@codewreck.org> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: v9fs@lists.linux.dev cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-12-01Merge tag 'vfs-6.19-rc1.folio' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull folio updates from Christian Brauner: "Add a new folio_next_pos() helper function that returns the file position of the first byte after the current folio. This is a common operation in filesystems when needing to know the end of the current folio. The helper is lifted from btrfs which already had its own version, and is now used across multiple filesystems and subsystems: - btrfs - buffer - ext4 - f2fs - gfs2 - iomap - netfs - xfs - mm This fixes a long-standing bug in ocfs2 on 32-bit systems with files larger than 2GiB. Presumably this is not a common configuration, but the fix is backported anyway. The other filesystems did not have bugs, they were just mildly inefficient. This also introduce uoff_t as the unsigned version of loff_t. A recent commit inadvertently changed a comparison from being unsigned (on 64-bit systems) to being signed (which it had always been on 32-bit systems), leading to sporadic fstests failures. Generally file sizes are restricted to being a signed integer, but in places where -1 is passed to indicate "up to the end of the file", it is convenient to have an unsigned type to ensure comparisons are always unsigned regardless of architecture" * tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: Add uoff_t mm: Use folio_next_pos() xfs: Use folio_next_pos() netfs: Use folio_next_pos() iomap: Use folio_next_pos() gfs2: Use folio_next_pos() f2fs: Use folio_next_pos() ext4: Use folio_next_pos() buffer: Use folio_next_pos() btrfs: Use folio_next_pos() filemap: Add folio_next_pos()
2025-10-31netfs: Use folio_next_pos()Matthew Wilcox (Oracle)
This is one instruction more efficient than open-coding folio_pos() + folio_size(). It's the equivalent of (x + y) << z rather than x << z + y << z. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20251024170822.1427218-9-willy@infradead.org Acked-by: David Howells <dhowells@redhat.com> Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Cc: David Howells <dhowells@redhat.com> Cc: Paulo Alcantara <pc@manguebit.org> Cc: netfs@lists.linux.dev Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20Coccinelle-based conversion to use ->i_state accessorsMateusz Guzik
All places were patched by coccinelle with the default expecting that ->i_lock is held, afterwards entries got fixed up by hand to use unlocked variants as needed. The script: @@ expression inode, flags; @@ - inode->i_state & flags + inode_state_read(inode) & flags @@ expression inode, flags; @@ - inode->i_state &= ~flags + inode_state_clear(inode, flags) @@ expression inode, flag1, flag2; @@ - inode->i_state &= ~flag1 & ~flag2 + inode_state_clear(inode, flag1 | flag2) @@ expression inode, flags; @@ - inode->i_state |= flags + inode_state_set(inode, flags) @@ expression inode, flags; @@ - inode->i_state = flags + inode_state_assign(inode, flags) @@ expression inode, flags; @@ - flags = inode->i_state + flags = inode_state_read(inode) @@ expression inode, flags; @@ - READ_ONCE(inode->i_state) & flags + inode_state_read(inode) & flags Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-29Merge tag 'vfs-6.18-rc1.workqueue' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs workqueue updates from Christian Brauner: "This contains various workqueue changes affecting the filesystem layer. Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This replaces the use of system_wq and system_unbound_wq. system_wq is a per-CPU workqueue which isn't very obvious from the name and system_unbound_wq is to be used when locality is not required. So this renames system_wq to system_percpu_wq, and system_unbound_wq to system_dfl_wq. This also adds a new WQ_PERCPU flag to allow the fs subsystem users to explicitly request the use of per-CPU behavior. Both WQ_UNBOUND and WQ_PERCPU flags coexist for one release cycle to allow callers to transition their calls. WQ_UNBOUND will be removed in a next release cycle" * tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: WQ_PERCPU added to alloc_workqueue users fs: replace use of system_wq with system_percpu_wq fs: replace use of system_unbound_wq with system_dfl_wq
2025-09-26netfs: fix reference leakMax Kellermann
Commit 20d72b00ca81 ("netfs: Fix the request's work item to not require a ref") modified netfs_alloc_request() to initialize the reference counter to 2 instead of 1. The rationale was that the requet's "work" would release the second reference after completion (via netfs_{read,write}_collection_worker()). That works most of the time if all goes well. However, it leaks this additional reference if the request is released before the I/O operation has been submitted: the error code path only decrements the reference counter once and the work item will never be queued because there will never be a completion. This has caused outages of our whole server cluster today because tasks were blocked in netfs_wait_for_outstanding_io(), leading to deadlocks in Ceph (another bug that I will address soon in another patch). This was caused by a netfs_pgpriv2_begin_copy_to_cache() call which failed in fscache_begin_write_operation(). The leaked netfs_io_request was never completed, leaving `netfs_inode.io_count` with a positive value forever. All of this is super-fragile code. Finding out which code paths will lead to an eventual completion and which do not is hard to see: - Some functions like netfs_create_write_req() allocate a request, but will never submit any I/O. - netfs_unbuffered_read_iter_locked() calls netfs_unbuffered_read() and then netfs_put_request(); however, netfs_unbuffered_read() can also fail early before submitting the I/O request, therefore another netfs_put_request() call must be added there. A rule of thumb is that functions that return a `netfs_io_request` do not submit I/O, and all of their callers must be checked. For my taste, the whole netfs code needs an overhaul to make reference counting easier to understand and less fragile & obscure. But to fix this bug here and now and produce a patch that is adequate for a stable backport, I tried a minimal approach that quickly frees the request object upon early failure. I decided against adding a second netfs_put_request() each time because that would cause code duplication which obscures the code further. Instead, I added the function netfs_put_failed_request() which frees such a failed request synchronously under the assumption that the reference count is exactly 2 (as initially set by netfs_alloc_request() and never touched), verified by a WARN_ON_ONCE(). It then deinitializes the request object (without going through the "cleanup_work" indirection) and frees the allocation (with RCU protection to protect against concurrent access by netfs_requests_seq_start()). All code paths that fail early have been changed to call netfs_put_failed_request() instead of netfs_put_request(). Additionally, I have added a netfs_put_request() call to netfs_unbuffered_read() as explained above because the netfs_put_failed_request() approach does not work there. Fixes: 20d72b00ca81 ("netfs: Fix the request's work item to not require a ref") Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-19fs: replace use of system_unbound_wq with system_dfl_wqMarco Crivellari
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. The old system_unbound_wq will be kept for a few release cycles. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://lore.kernel.org/20250916082906.77439-2-marco.crivellari@suse.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-15netfs: Prevent duplicate unlockingLizhi Xu
The filio lock has been released here, so there is no need to jump to error_folio_unlock to release it again. Reported-by: syzbot+b73c7d94a151e2ee1e9b@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b73c7d94a151e2ee1e9b Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com> Acked-by: David Howells <dhowells@redhat.com> Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-15netfs: Fix unbuffered write error handlingDavid Howells
If all the subrequests in an unbuffered write stream fail, the subrequest collector doesn't update the stream->transferred value and it retains its initial LONG_MAX value. Unfortunately, if all active streams fail, then we take the smallest value of { LONG_MAX, LONG_MAX, ... } as the value to set in wreq->transferred - which is then returned from ->write_iter(). LONG_MAX was chosen as the initial value so that all the streams can be quickly assessed by taking the smallest value of all stream->transferred - but this only works if we've set any of them. Fix this by adding a flag to indicate whether the value in stream->transferred is valid and checking that when we integrate the values. stream->transferred can then be initialised to zero. This was found by running the generic/750 xfstest against cifs with cache=none. It splices data to the target file. Once (if) it has used up all the available scratch space, the writes start failing with ENOSPC. This causes ->write_iter() to fail. However, it was returning wreq->transferred, i.e. LONG_MAX, rather than an error (because it thought the amount transferred was non-zero) and iter_file_splice_write() would then try to clean up that amount of pipe bufferage - leading to an oops when it overran. The kernel log showed: CIFS: VFS: Send error in write = -28 followed by: BUG: kernel NULL pointer dereference, address: 0000000000000008 with: RIP: 0010:iter_file_splice_write+0x3a4/0x520 do_splice+0x197/0x4e0 or: RIP: 0010:pipe_buf_release (include/linux/pipe_fs_i.h:282) iter_file_splice_write (fs/splice.c:755) Also put a warning check into splice to announce if ->write_iter() returned that it had written more than it was asked to. Fixes: 288ace2f57c9 ("netfs: New writeback implementation") Reported-by: Xiaoli Feng <fengxiaoli0714@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220445 Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/915443.1755207950@warthog.procyon.org.uk cc: Paulo Alcantara <pc@manguebit.org> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <sprasad@microsoft.com> cc: netfs@lists.linux.dev cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14netfs: Fix race between cache write completion and ALL_QUEUED being setDavid Howells
When netfslib is issuing subrequests, the subrequests start processing immediately and may complete before we reach the end of the issuing function. At the end of the issuing function we set NETFS_RREQ_ALL_QUEUED to indicate to the collector that we aren't going to issue any more subreqs and that it can do the final notifications and cleanup. Now, this isn't a problem if the request is synchronous (NETFS_RREQ_OFFLOAD_COLLECTION is unset) as the result collection will be done in-thread and we're guaranteed an opportunity to run the collector. However, if the request is asynchronous, collection is primarily triggered by the termination of subrequests queuing it on a workqueue. Now, a race can occur here if the app thread sets ALL_QUEUED after the last subrequest terminates. This can happen most easily with the copy2cache code (as used by Ceph) where, in the collection routine of a read request, an asynchronous write request is spawned to copy data to the cache. Folios are added to the write request as they're unlocked, but there may be a delay before ALL_QUEUED is set as the write subrequests may complete before we get there. If all the write subreqs have finished by the ALL_QUEUED point, no further events happen and the collection never happens, leaving the request hanging. Fix this by queuing the collector after setting ALL_QUEUED. This is a bit heavy-handed and it may be sufficient to do it only if there are no extant subreqs. Also add a tracepoint to cross-reference both requests in a copy-to-request operation and add a trace to the netfs_rreq tracepoint to indicate the setting of ALL_QUEUED. Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Reported-by: Max Kellermann <max.kellermann@ionos.com> Link: https://lore.kernel.org/r/CAKPOu+8z_ijTLHdiCYGU_Uk7yYD=shxyGLwfe-L7AV3DhebS3w@mail.gmail.com/ Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/20250711151005.2956810-3-dhowells@redhat.com Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> cc: Paulo Alcantara <pc@manguebit.org> cc: Viacheslav Dubeyko <slava@dubeyko.com> cc: Alex Markuze <amarkuze@redhat.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: netfs@lists.linux.dev cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14netfs: Fix copy-to-cache so that it performs collection with ceph+fscacheDavid Howells
The netfs copy-to-cache that is used by Ceph with local caching sets up a new request to write data just read to the cache. The request is started and then left to look after itself whilst the app continues. The request gets notified by the backing fs upon completion of the async DIO write, but then tries to wake up the app because NETFS_RREQ_OFFLOAD_COLLECTION isn't set - but the app isn't waiting there, and so the request just hangs. Fix this by setting NETFS_RREQ_OFFLOAD_COLLECTION which causes the notification from the backing filesystem to put the collection onto a work queue instead. Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Reported-by: Max Kellermann <max.kellermann@ionos.com> Link: https://lore.kernel.org/r/CAKPOu+8z_ijTLHdiCYGU_Uk7yYD=shxyGLwfe-L7AV3DhebS3w@mail.gmail.com/ Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/20250711151005.2956810-2-dhowells@redhat.com Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> cc: Paulo Alcantara <pc@manguebit.org> cc: Viacheslav Dubeyko <slava@dubeyko.com> cc: Alex Markuze <amarkuze@redhat.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: netfs@lists.linux.dev cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01netfs: Update tracepoints in a number of waysDavid Howells
Make a number of updates to the netfs tracepoints: (1) Remove a duplicate trace from netfs_unbuffered_write_iter_locked(). (2) Move the trace in netfs_wake_rreq_flag() to after the flag is cleared so that the change appears in the trace. (3) Differentiate the use of netfs_rreq_trace_wait/woke_queue symbols. (4) Don't do so many trace emissions in the wait functions as some of them are redundant. (5) In netfs_collect_read_results(), differentiate a subreq that's being abandoned vs one that has been consumed in a regular way. (6) Add a tracepoint to indicate the call to ->ki_complete(). (7) Don't double-increment the subreq_counter when retrying a write. (8) Move the netfs_sreq_trace_io_progress tracepoint within cifs code to just MID_RESPONSE_RECEIVED and add different tracepoints for other MID states and note check failure. Signed-off-by: David Howells <dhowells@redhat.com> Co-developed-by: Paulo Alcantara <pc@manguebit.org> Signed-off-by: Paulo Alcantara <pc@manguebit.org> Link: https://lore.kernel.org/20250701163852.2171681-14-dhowells@redhat.com cc: Steve French <sfrench@samba.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-cifs@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01netfs: Renumber the NETFS_RREQ_* flags to make traces easier to readDavid Howells
Renumber the NETFS_RREQ_* flags to put the most useful status bits in the bottom nibble - and therefore the last hex digit in the trace output - making it easier to grasp the state at a glance. In particular, put the IN_PROGRESS flag in bit 0 and ALL_QUEUED at bit 1. Also make the flags field in /proc/fs/netfs/requests larger to accommodate all the flags. Also make the flags field in the netfs_sreq tracepoint larger to accommodate all the NETFS_SREQ_* flags. Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/20250701163852.2171681-13-dhowells@redhat.com Reviewed-by: Paulo Alcantara <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01netfs: Merge i_size update functionsDavid Howells
Netfslib has two functions for updating the i_size after a write: one for buffered writes into the pagecache and one for direct/unbuffered writes. However, what needs to be done is much the same in both cases, so merge them together. This does raise one question, though: should updating the i_size after a direct write do the same estimated update of i_blocks as is done for buffered writes. Also get rid of the cleanup function pointer from netfs_io_request as it's only used for direct write to update i_size; instead do the i_size setting directly from write collection. Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/20250701163852.2171681-12-dhowells@redhat.com cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01netfs: Fix i_size updatingDavid Howells
Fix the updating of i_size, particularly in regard to the completion of DIO writes and especially async DIO writes by using a lock. The bug is triggered occasionally by the generic/207 xfstest as it chucks a bunch of AIO DIO writes at the filesystem and then checks that fstat() returns a reasonable st_size as each completes. The problem is that netfs is trying to do "if new_size > inode->i_size, update inode->i_size" sort of thing but without a lock around it. This can be seen with cifs, but shouldn't be seen with kafs because kafs serialises modification ops on the client whereas cifs sends the requests to the server as they're generated and lets the server order them. Fixes: 153a9961b551 ("netfs: Implement unbuffered/DIO write support") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/20250701163852.2171681-11-dhowells@redhat.com Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>