summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-09-30nfs/localio: refactor iocb and iov_iter_bvec initializationMike Snitzer
nfs_local_iter_init() is updated to follow the same pattern to initializing LOCALIO's iov_iter_bvec as was established by nfsd_iter_read(). Other LOCALIO iocb initialization refactoring in this commit offers incremental cleanup that will be taken further by the next commit. No functional change. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-30nfs/localio: avoid issuing misaligned IO using O_DIRECTMike Snitzer
Add nfsd_file_dio_alignment and use it to avoid issuing misaligned IO using O_DIRECT. Any misaligned DIO falls back to using buffered IO. Because misaligned DIO is now handled safely, remove the nfs modparam 'localio_O_DIRECT_semantics' that was added to require users opt-in to the requirement that all O_DIRECT be properly DIO-aligned. Also, introduce nfs_iov_iter_aligned_bvec() which is a variant of iov_iter_aligned_bvec() that also verifies the offset associated with an iov_iter is DIO-aligned. NOTE: in a parallel effort, iov_iter_aligned_bvec() is being removed along with iov_iter_is_aligned(). Lastly, add pr_info_ratelimited if underlying filesystem returns -EINVAL because it was made to try O_DIRECT for IO that is not DIO-aligned (shouldn't happen, so its best to be louder if it does). Fixes: 3feec68563d ("nfs/localio: add direct IO enablement with sync and async IO support") Signed-off-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-30nfs/localio: make trace_nfs_local_open_fh more usefulMike Snitzer
Always trigger trace event when LOCALIO opens a file. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-30NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN supportMike Snitzer
Use STATX_DIOALIGN and STATX_DIO_READ_ALIGN to get DIO alignment attributes from the underlying filesystem and store them in the associated nfsd_file. This is done when the nfsd_file is first opened for each regular file. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-30Merge tag 'for-6.18-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "There are no new features, the changes are in the core code, notably tree-log error handling and reporting improvements, and initial support for block size > page size. Performance improvements: - search data checksums in the commit root (previous transaction) to avoid locking contention, this improves parallelism of read heavy/low write workloads, and also reduces transaction commit time; on real and reproducer workload the sync time went from minutes to tens of seconds (workload and numbers are in the changelog) Core: - tree-log updates: - error handling improvements, transaction aborts - add new error state 'O' (printed in status messages) when log replay fails and is aborted - reduced number of btrfs_path allocations when traversing the tree - 'block size > page size' support - basic implementation with limitations, under experimental build - limitations: no direct io, raid56, encoded read (standalone and in send ioctl), encoded write - preparatory work for compression, removing implicit assumptions of page and block sizes - compression workspaces are now per-filesystem, we cannot assume common block size for work memory among different filesystems - tree-checker now verifies INODE_EXTREF item (which is implementing hardlinks) - tree leaf pretty printer updates, there were missing data from items, keys/items - move config option CONFIG_BTRFS_REF_VERIFY to CONFIG_BTRFS_DEBUG, it's a debugging feature and not needed to be enabled separately - more struct btrfs_path auto free updates - use ref_tracker API for tracking delayed inodes, enabled by mount option 'ref_verify', allowing to better pinpoint leaking references - in zoned mode, avoid selecting data relocation zoned for ordinary data block groups - updated and enhanced error messages - lots of cleanups and refactoring" * tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (113 commits) btrfs: use smp_mb__after_atomic() when forcing COW in create_pending_snapshot() btrfs: add unlikely annotations to branches leading to transaction abort btrfs: add unlikely annotations to branches leading to EIO btrfs: add unlikely annotations to branches leading to EUCLEAN btrfs: more trivial BTRFS_PATH_AUTO_FREE conversions btrfs: zoned: don't fail mount needlessly due to too many active zones btrfs: use kmalloc_array() for open-coded arithmetic in kmalloc() btrfs: enable experimental bs > ps support btrfs: add extra ASSERT()s to catch unaligned bios btrfs: fix symbolic link reading when bs > ps btrfs: prepare scrub to support bs > ps cases btrfs: prepare zlib to support bs > ps cases btrfs: prepare lzo to support bs > ps cases btrfs: prepare zstd to support bs > ps cases btrfs: prepare compression folio alloc/free for bs > ps cases btrfs: fix the incorrect max_bytes value for find_lock_delalloc_range() btrfs: remove pointless key offset setup in create_pending_snapshot() btrfs: annotate btrfs_is_testing() as unlikely and make it return bool btrfs: make the rule checking more readable for should_cow_block() btrfs: simplify inline extent end calculation at replay_one_extent() ...
2025-09-30fs/orangefs: Replace kzalloc + copy_from_user with memdup_user_nulThorsten Blum
Replace kzalloc() followed by copy_from_user() with memdup_user_nul() to simplify and improve orangefs_debug_write(). Allocate only 'count' bytes instead of the maximum size ORANGEFS_MAX_DEBUG_STRING_LEN, and set 'buf' to NULL to ensure kfree(buf) still works. No functional changes intended. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2025-09-30orangefs: fix xattr related buffer overflow...Mike Marshall
Willy Tarreau <w@1wt.eu> forwarded me a message from Disclosure <disclosure@aisle.com> with the following warning: > The helper `xattr_key()` uses the pointer variable in the loop condition > rather than dereferencing it. As `key` is incremented, it remains non-NULL > (until it runs into unmapped memory), so the loop does not terminate on > valid C strings and will walk memory indefinitely, consuming CPU or hanging > the thread. I easily reproduced this with setfattr and getfattr, causing a kernel oops, hung user processes and corrupted orangefs files. Disclosure sent along a diff (not a patch) with a suggested fix, which I based this patch on. After xattr_key started working right, xfstest generic/069 exposed an xattr related memory leak that lead to OOM. xattr_key returns a hashed key. When adding xattrs to the orangefs xattr cache, orangefs used hash_add, a kernel hashing macro. hash_add also hashes the key using hash_log which resulted in additions to the xattr cache going to the wrong hash bucket. generic/069 tortures a single file and orangefs does a getattr for the xattr "security.capability" every time. Orangefs negative caches on xattrs which includes a kmalloc. Since adds to the xattr cache were going to the wrong bucket, every getattr for "security.capability" resulted in another kmalloc, none of which were ever freed. I changed the two uses of hash_add to hlist_add_head instead and the memory leak ceased and generic/069 quit throwing furniture. Signed-off-by: Mike Marshall <hubcap@omnibond.com> Reported-by: Stanislav Fort of Aisle Research <stanislav.fort@aisle.com>
2025-09-30orangefs: Remove unused type in macro fill_default_sys_attrsZhen Ni
Remove the unused type parameter from the macro definition and all its callers, making the interface consistent with its actual usage. Fixes: 69a23de2f3de ("orangefs: clean up fill_default_sys_attrs") Signed-off-by: Zhen Ni <zhen.ni@easystack.cn> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2025-09-30exfat: Add support for FS_IOC_{GET,SET}FSLABELEthan Ferguson
Add support for reading / writing to the exfat volume label from the FS_IOC_GETFSLABEL and FS_IOC_SETFSLABEL ioctls Co-developed-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Signed-off-by: Ethan Ferguson <ethan.ferguson@zetier.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2025-09-30exfat: combine iocharset and utf8 option setupSang-Heon Jeon
Currently, exfat utf8 mount option depends on the iocharset option value. After exfat remount, utf8 option may become inconsistent with iocharset option. If the options are inconsistent; (specifically, iocharset=utf8 but utf8=0) readdir may reference uninitalized NLS, leading to a null pointer dereference. Extract and combine utf8/iocharset setup logic into exfat_set_iocharset(). Then Replace iocharset setup logic to exfat_set_iocharset to prevent utf8/iocharset option inconsistentcy after remount. Reported-by: syzbot+3e9cb93e3c5f90d28e19@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=3e9cb93e3c5f90d28e19 Signed-off-by: Sang-Heon Jeon <ekffu200098@gmail.com> Fixes: acab02ffcd6b ("exfat: support modifying mount options via remount") Tested-by: syzbot+3e9cb93e3c5f90d28e19@syzkaller.appspotmail.com Reviewed-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2025-09-30exfat: support modifying mount options via remountYuezhang Mo
Before this commit, all exfat-defined mount options could not be modified dynamically via remount, and no error was returned. After this commit, these three exfat-defined mount options (discard, zero_size_dir, and errors) can be modified dynamically via remount. While other exfat-defined mount options cannot be modified dynamically via remount because their old settings are cached in inodes or dentries, modifying them will be rejected with an error. Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2025-09-30exfat: optimize allocation bitmap loading timeNamjae Jeon
Loading the allocation bitmap is very slow if user set the small cluster size on large partition. For optimizing it, This patch uses sb_breadahead() read the allocation bitmap. It will improve the mount time. The following is the result of about 4TB partition(2KB cluster size) on my target. without patch: real 0m41.746s user 0m0.011s sys 0m0.000s with patch: real 0m2.525s user 0m0.008s sys 0m0.008s Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Reviewed-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2025-09-30exfat: Remove unnecessary parenthesesLiao Yuanhong
When using &, it's unnecessary to have parentheses afterward. Remove redundant parentheses to enhance readability. Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2025-09-30exfat: drop redundant conversion to boolXichao Zhao
The result of integer comparison already evaluates to bool. No need for explicit conversion. No functional impact. Signed-off-by: Xichao Zhao <zhao.xichao@vivo.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2025-09-30exfat: validate cluster allocation bits of the allocation bitmapNamjae Jeon
syzbot created an exfat image with cluster bits not set for the allocation bitmap. exfat-fs reads and uses the allocation bitmap without checking this. The problem is that if the start cluster of the allocation bitmap is 6, cluster 6 can be allocated when creating a directory with mkdir. exfat zeros out this cluster in exfat_mkdir, which can delete existing entries. This can reallocate the allocated entries. In addition, the allocation bitmap is also zeroed out, so cluster 6 can be reallocated. This patch adds exfat_test_bitmap_range to validate that clusters used for the allocation bitmap are correctly marked as in-use. Reported-by: syzbot+a725ab460fc1def9896f@syzkaller.appspotmail.com Tested-by: syzbot+a725ab460fc1def9896f@syzkaller.appspotmail.com Reviewed-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2025-09-30exfat: limit log print for IO errorChi Zhiling
For exFAT filesystems with 4MB read_ahead_size, removing the storage device when the read operation is in progress, which cause the last read syscall spent 150s [1]. The main reason is that exFAT generates excessive log messages [2]. After applying this patch, approximately 300,000 lines of log messages were suppressed, and the delay of the last read() syscall was reduced to about 4 seconds. [1]: write(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072 <0.000120> read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072 <0.000032> write(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072 <0.000119> read(4, 0x7fccf28ae000, 131072) = -1 EIO (Input/output error) <150.186215> [2]: [ 333.696603] exFAT-fs (vdb): error, failed to access to FAT (entry 0x0000d780, err:-5) [ 333.697378] exFAT-fs (vdb): error, failed to access to FAT (entry 0x0000d780, err:-5) [ 333.698156] exFAT-fs (vdb): error, failed to access to FAT (entry 0x0000d780, err:-5) Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2025-09-29smb: client: fix crypto buffers in non-linear memoryEnzo Matsumiya
The crypto API, through the scatterlist API, expects input buffers to be in linear memory. We handle this with the cifs_sg_set_buf() helper that converts vmalloc'd memory to their corresponding pages. However, when we allocate our aead_request buffer (@creq in smb2ops.c::crypt_message()), we do so with kvzalloc(), which possibly puts aead_request->__ctx in vmalloc area. AEAD algorithm then uses ->__ctx for its private/internal data and operations, and uses sg_set_buf() for such data on a few places. This works fine as long as @creq falls into kmalloc zone (small requests) or vmalloc'd memory is still within linear range. Tasks' stacks are vmalloc'd by default (CONFIG_VMAP_STACK=y), so too many tasks will increment the base stacks' addresses to a point where virt_addr_valid(buf) will fail (BUG() in sg_set_buf()) when that happens. In practice: too many parallel reads and writes on an encrypted mount will trigger this bug. To fix this, always alloc @creq with kmalloc() instead. Also drop the @sensitive_size variable/arguments since kfree_sensitive() doesn't need it. Backtrace: [ 945.272081] ------------[ cut here ]------------ [ 945.272774] kernel BUG at include/linux/scatterlist.h:209! [ 945.273520] Oops: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC NOPTI [ 945.274412] CPU: 7 UID: 0 PID: 56 Comm: kworker/u33:0 Kdump: loaded Not tainted 6.15.0-lku-11779-g8e9d6efccdd7-dirty #1 PREEMPT(voluntary) [ 945.275736] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-2-gc13ff2cd-prebuilt.qemu.org 04/01/2014 [ 945.276877] Workqueue: writeback wb_workfn (flush-cifs-2) [ 945.277457] RIP: 0010:crypto_gcm_init_common+0x1f9/0x220 [ 945.278018] Code: b0 00 00 00 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 48 c7 c0 00 00 00 80 48 2b 05 5c 58 e5 00 e9 58 ff ff ff <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 48 c7 04 24 01 00 00 00 48 8b [ 945.279992] RSP: 0018:ffffc90000a27360 EFLAGS: 00010246 [ 945.280578] RAX: 0000000000000000 RBX: ffffc90001d85060 RCX: 0000000000000030 [ 945.281376] RDX: 0000000000080000 RSI: 0000000000000000 RDI: ffffc90081d85070 [ 945.282145] RBP: ffffc90001d85010 R08: ffffc90001d85000 R09: 0000000000000000 [ 945.282898] R10: ffffc90001d85090 R11: 0000000000001000 R12: ffffc90001d85070 [ 945.283656] R13: ffff888113522948 R14: ffffc90001d85060 R15: ffffc90001d85010 [ 945.284407] FS: 0000000000000000(0000) GS:ffff8882e66cf000(0000) knlGS:0000000000000000 [ 945.285262] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 945.285884] CR2: 00007fa7ffdd31f4 CR3: 000000010540d000 CR4: 0000000000350ef0 [ 945.286683] Call Trace: [ 945.286952] <TASK> [ 945.287184] ? crypt_message+0x33f/0xad0 [cifs] [ 945.287719] crypto_gcm_encrypt+0x36/0xe0 [ 945.288152] crypt_message+0x54a/0xad0 [cifs] [ 945.288724] smb3_init_transform_rq+0x277/0x300 [cifs] [ 945.289300] smb_send_rqst+0xa3/0x160 [cifs] [ 945.289944] cifs_call_async+0x178/0x340 [cifs] [ 945.290514] ? __pfx_smb2_writev_callback+0x10/0x10 [cifs] [ 945.291177] smb2_async_writev+0x3e3/0x670 [cifs] [ 945.291759] ? find_held_lock+0x32/0x90 [ 945.292212] ? netfs_advance_write+0xf2/0x310 [ 945.292723] netfs_advance_write+0xf2/0x310 [ 945.293210] netfs_write_folio+0x346/0xcc0 [ 945.293689] ? __pfx__raw_spin_unlock_irq+0x10/0x10 [ 945.294250] netfs_writepages+0x117/0x460 [ 945.294724] do_writepages+0xbe/0x170 [ 945.295152] ? find_held_lock+0x32/0x90 [ 945.295600] ? kvm_sched_clock_read+0x11/0x20 [ 945.296103] __writeback_single_inode+0x56/0x4b0 [ 945.296643] writeback_sb_inodes+0x229/0x550 [ 945.297140] __writeback_inodes_wb+0x4c/0xe0 [ 945.297642] wb_writeback+0x2f1/0x3f0 [ 945.298069] wb_workfn+0x300/0x490 [ 945.298472] process_one_work+0x1fe/0x590 [ 945.298949] worker_thread+0x1ce/0x3c0 [ 945.299397] ? __pfx_worker_thread+0x10/0x10 [ 945.299900] kthread+0x119/0x210 [ 945.300285] ? __pfx_kthread+0x10/0x10 [ 945.300729] ret_from_fork+0x119/0x1b0 [ 945.301163] ? __pfx_kthread+0x10/0x10 [ 945.301601] ret_from_fork_asm+0x1a/0x30 [ 945.302055] </TASK> Fixes: d08089f649a0 ("cifs: Change the I/O paths to use an iterator rather than a page list") Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-29smb: Use arc4 library instead of duplicate arc4 codeEric Biggers
fs/smb/common/cifs_arc4.c has an implementation of ARC4, but a copy of this same code is also present in lib/crypto/arc4.c to serve the other users of this legacy algorithm in the kernel. Remove the duplicate implementation in fs/smb/, which seems to have been added because of a misunderstanding, and just use the lib/crypto/ one. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-29smb: client: add tcon information to smb2_reconnect() debug messagesHenrique Carvalho
smb2_reconnect() debug messages lack tcon context, making it hard to identify which tcon is reconnecting in multi-share environments. Change cifs_dbg() to cifs_tcon_dbg() to include tcon information. Closes: https://bugzilla.suse.com/show_bug.cgi?id=1234066 Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-29Merge tag 'pstore-v6.18-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull tiny pstore update from Kees Cook: - Clarify various comments for better understanding (Eugen Hristev) * tag 'pstore-v6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore/zone: rewrite some comments for better understanding
2025-09-29Merge tag 'execve-v6.18-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve updates from Kees Cook: - binfmt_elf: preserve original ELF e_flags for core dumps (Svetlana Parfenova) - exec: Fix incorrect type for ret (Xichao Zhao) - binfmt_elf: Replace offsetof() with struct_size() in fill_note_info() (Xichao Zhao) * tag 'execve-v6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: binfmt_elf: preserve original ELF e_flags for core dumps binfmt_elf: Replace offsetof() with struct_size() in fill_note_info() exec: Fix incorrect type for ret
2025-09-29Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linuxLinus Torvalds
Pull interleaved SHA-256 hashing support from Eric Biggers: "Optimize fsverity with 2-way interleaved hashing Add support for 2-way interleaved SHA-256 hashing to lib/crypto/, and make fsverity use it for faster file data verification. This improves fsverity performance on many x86_64 and arm64 processors. Later, I plan to make dm-verity use this too" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux: fsverity: Use 2-way interleaved SHA-256 hashing when supported fsverity: Remove inode parameter from fsverity_hash_block() lib/crypto: tests: Add tests and benchmark for sha256_finup_2x() lib/crypto: x86/sha256: Add support for 2-way interleaved hashing lib/crypto: arm64/sha256: Add support for 2-way interleaved hashing lib/crypto: sha256: Add support for 2-way interleaved hashing
2025-09-29Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linuxLinus Torvalds
Pull fscrypt updates from Eric Biggers: "Make fs/crypto/ use the HMAC-SHA512 library functions instead of crypto_shash. This is simpler, faster, and more reliable" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux: fscrypt: use HMAC-SHA512 library for HKDF fscrypt: Remove redundant __GFP_NOWARN
2025-09-29Merge tag 'dlm-6.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "This adds a dlm_release_lockspace() flag to request that node-failure recovery be performed for the node leaving the lockspace. The implementation of this flag requires coordination with userland clustering components. It's been requested for use by GFS2" * tag 'dlm-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: check for undefined release_option values dlm: handle release_option as unsigned dlm: move to rinfo for all middle conversion cases dlm: handle invalid lockspace member remove dlm: add new flag DLM_RELEASE_RECOVER for dlm_lockspace_release dlm: add new configfs entry release_recover for lockspace members dlm: add new RELEASE_RECOVER uevent attribute for release_lockspace dlm: use defines for force values in dlm_release_lockspace dlm: check for defined force value in dlm_lockspace_release
2025-09-29Merge tag 'erofs-for-6.18-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: - Support FS_IOC_GETFSLABEL ioctl - Stop meaningless read-more policy on fragment extents - Remove a duplicate check for ztailpacking inline * tag 'erofs-for-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: drop redundant sanity check for ztailpacking inline erofs: Add support for FS_IOC_GETFSLABEL erofs: avoid reading more for fragment maps
2025-09-29Merge tag 'hfs-v6.18-tag1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vdubeyko/hfs Pull hfs updates from Viacheslav Dubeyko: "This contains several fixes of syzbot reported issues, HFS/HFS+ fixes of xfstests failures, and rework of HFS/HFS+ debug output subsystem. - Kang Chen fixed a slab-out-of-bounds issue in hfsplus_uni2asc() when hfsplus_uni2asc() is called from hfsplus_listxattr(). - Yang Chenzhi fixed a crash in hfsplus_bmap_alloc() if record offset or length is larger than node_size. - Yangtao Li corrected the error code from hfsplus_fill_super() if Catalog File contains corrupted record for the case of hidden directory's type. - KMSAN uninit-value fixes: hfs_find_set_zero_bits() and __hfsplus_ext_cache_extent() use kzalloc() instead of kmalloc(), and in hfsplus_delete_cat() by proper initialization of struct hfsplus_inode_info in the hfsplus_iget() logic. - A slab-out-of-bounds issue could happen in hfsplus_strcasecmp() if the length field of struct hfsplus_unistr is bigger than HFSPLUS_MAX_STRLEN. Fixed by checking the length of comparing strings, and if the strings' length is bigger than HFSPLUS_MAX_STRLEN, then the length is corrected to this value. - The generic/736 xfstest failed for HFS because the HFS volume becomes corrupted after the test run. The main reason was the absence of logic that corrects mdb->drNxtCNID/HFS_SB(sb)->next_id (next unused CNID) after deleting a record in Catalog File. That was fixed by implementing the necessary logic in hfs_correct_next_unused_CNID()" * tag 'hfs-v6.18-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/vdubeyko/hfs: hfs/hfsplus: rework debug output subsystem hfsplus: fix slab-out-of-bounds read in hfsplus_strcasecmp() hfsplus: fix slab-out-of-bounds read in hfsplus_uni2asc() hfs: clear offset and space out of valid records in b-tree node hfs: add logic of correcting a next unused CNID hfsplus: fix KMSAN uninit-value issue in hfsplus_delete_cat() hfs: fix KMSAN uninit-value issue in hfs_find_set_zero_bits() hfs: make proper initalization of struct hfs_find_data hfsplus: fix KMSAN uninit-value issue in __hfsplus_ext_cache_extent() hfs: validate record offset in hfsplus_bmap_alloc hfsplus: return EIO when type of hidden directory mismatch in hfsplus_fill_super() MAINTAINERS: update location of hfs&hfsplus trees
2025-09-29Merge tag 'v6.18-rc-part1-smb3-common' of git://git.samba.org/ksmbdLinus Torvalds
Pull smb restructuring updates from Steve French: "Large set of small restructuring smbdirect related patches for cifs.ko and ksmbd.ko. This is the next step in order to use common structures for smbdirect handling across both modules. And also includes improved handling of broken connections, as well as fixed negotiation as rdma resources. Moving to common functions is planned for 6.19, as well as also providing smbdirect via sockets to userspace (e.g. for samba to also be able to use smbdirect for userspace server and userspace client tools). This was heavily reviewed and tested at the recent SMB3.1.1 test event at SDC" * tag 'v6.18-rc-part1-smb3-common' of git://git.samba.org/ksmbd: (159 commits) smb: server: let smb_direct_flush_send_list() invalidate a remote key first smb: server: make use of ib_alloc_cq_any() instead of ib_alloc_cq() smb: server: make consitent use of spin_lock_irq{save,restore}() in transport_rdma.c smb: server: let {free_transport,smb_direct_disconnect_rdma_{work,connection}}() wake up all wait queues smb: server: let smb_direct_disconnect_rdma_connection() disable all work but disconnect_work smb: server: fill in smbdirect_socket.first_error on error smb: server: let smb_direct_disconnect_rdma_connection() set SMBDIRECT_SOCKET_ERROR... smb: server: pass struct smbdirect_socket to smb_direct_send_negotiate_response() smb: server: pass struct smbdirect_socket to {enqueue,get_first}_reassembly() smb: server: pass struct smbdirect_socket to smb_direct_post_send_data() smb: server: pass struct smbdirect_socket to post_sendmsg() smb: server: pass struct smbdirect_socket to smb_direct_create_header() smb: server: pass struct smbdirect_socket to manage_keep_alive_before_sending() smb: server: pass struct smbdirect_socket to manage_credits_prior_sending() smb: server: pass struct smbdirect_socket to calc_rw_credits() smb: server: pass struct smbdirect_socket to wait_for_rw_credits() smb: server: pass struct smbdirect_socket to wait_for_send_credits() smb: server: pass struct smbdirect_socket to wait_for_credits() smb: server: pass struct smbdirect_socket to smb_direct_flush_send_list() smb: server: pass struct smbdirect_socket to smb_direct_post_send() ...
2025-09-29Merge tag 'xfs-merge-6.18' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Carlos Maiolino: "For this merge window, there are really no new features, but there are a few things worth to emphasize: - Deprecated for years already, the (no)attr2 and (no)ikeep mount options have been removed for good - Several cleanups (specially from typedefs) and bug fixes - Improvements made in the online repair reap calculations - online fsck is now enabled by default" * tag 'xfs-merge-6.18' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (53 commits) xfs: rework datasync tracking and execution xfs: rearrange code in xfs_inode_item_precommit xfs: scrub: use kstrdup_const() for metapath scan setups xfs: use bt_nr_sectors in xfs_dax_translate_range xfs: track the number of blocks in each buftarg xfs: constify xfs_errortag_random_default xfs: improve default maximum number of open zones xfs: improve zone statistics message xfs: centralize error tag definitions xfs: remove pointless externs in xfs_error.h xfs: remove the expr argument to XFS_TEST_ERROR xfs: remove xfs_errortag_set xfs: remove xfs_errortag_get xfs: move the XLOG_REG_ constants out of xfs_log_format.h xfs: adjust the hint based zone allocation policy xfs: refactor hint based zone allocation fs: add an enum for number of life time hints xfs: fix log CRC mismatches between i386 and other architectures xfs: rename the old_crc variable in xlog_recover_process xfs: remove the unused xfs_log_iovec_t typedef ...
2025-09-29Merge tag 'gfs2-for-6.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 updates from Andreas Gruenbacher: - Partially revert "gfs2: do_xmote fixes" to ignore dlm_lock() errors during withdraw; passing on those errors doesn't help - Change the LM_FLAG_TRY and LM_FLAG_TRY_1CB logic in add_to_queue() to check if the holder would actually block - Move some more dlm specific code from glock.c to lock_dlm.c - Remove the unused dlm alternate locking mode code - Add proper locking to make sure that dlm lockspaces are never used after being released - Various other cleanups * tag 'gfs2-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Fix unlikely race in gdlm_put_lock gfs2: Add proper lockspace locking gfs2: Minor run_queue fixes gfs2: run_queue cleanup gfs2: Simplify do_promote gfs2: Get rid of GLF_INVALIDATE_IN_PROGRESS gfs2: Fix GLF_INVALIDATE_IN_PROGRESS flag clearing in do_xmote gfs2: Remove duplicate check in do_xmote gfs2: Fix LM_FLAG_TRY* logic in add_to_queue gfs2: Remove DLM_LKF_ALTCW / DLM_LKF_ALTPR code gfs2: Further sanitize lock_dlm.c gfs2: Do not use atomic operations unnecessarily gfs2: Sanitize gfs2_meta_check, gfs2_metatype_check, gfs2_io_error gfs2: Turn gfs2_withdraw into a void function gfs2: Partially revert "gfs2: do_xmote fixes" gfs2: Simplify refcounting in do_xmote gfs2: do_xmote cleanup gfs2: Remove space before newline gfs2: Remove unused sd_withdraw_wait field gfs2: Remove unused GIF_FREE_VFS_INODE flag
2025-09-29Remove bcachefs core codeLinus Torvalds
bcachefs was marked 'externally maintained' in 6.17 but the code remained to make the transition smoother. It's now a DKMS module, making the in-kernel code stale, so remove it to avoid any version confusion. Link: https://lore.kernel.org/linux-bcachefs/yokpt2d2g2lluyomtqrdvmkl3amv3kgnipmenobkpgx537kay7@xgcgjviv3n7x/T/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-09-29Merge tag 'vfs-6.18-rc1.async' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs async directory updates from Christian Brauner: "This contains further preparatory changes for the asynchronous directory locking scheme: - Add lookup_one_positive_killable() which allows overlayfs to perform lookup that won't block on a fatal signal - Unify the mount idmap handling in struct renamedata as a rename can only happen within a single mount - Introduce kern_path_parent() for audit which sets the path to the parent and returns a dentry for the target without holding any locks on return - Rename kern_path_locked() as it is only used to prepare for the removal of an object from the filesystem: kern_path_locked() => start_removing_path() kern_path_create() => start_creating_path() user_path_create() => start_creating_user_path() user_path_locked_at() => start_removing_user_path_at() done_path_create() => end_creating_path() NA => end_removing_path()" * tag 'vfs-6.18-rc1.async' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: debugfs: rename start_creating() to debugfs_start_creating() VFS: rename kern_path_locked() and related functions. VFS/audit: introduce kern_path_parent() for audit VFS: unify old_mnt_idmap and new_mnt_idmap in renamedata VFS: discard err2 in filename_create() VFS/ovl: add lookup_one_positive_killable()
2025-09-29Merge tag 'vfs-6.18-rc1.writeback' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs writeback updates from Christian Brauner: "This contains work adressing lockups reported by users when a systemd unit reading lots of files from a filesystem mounted with the lazytime mount option exits. With the lazytime mount option enabled we can be switching many dirty inodes on cgroup exit to the parent cgroup. The numbers observed in practice when systemd slice of a large cron job exits can easily reach hundreds of thousands or millions. The logic in inode_do_switch_wbs() which sorts the inode into appropriate place in b_dirty list of the target wb however has linear complexity in the number of dirty inodes thus overall time complexity of switching all the inodes is quadratic leading to workers being pegged for hours consuming 100% of the CPU and switching inodes to the parent wb. Simple reproducer of the issue: FILES=10000 # Filesystem mounted with lazytime mount option MNT=/mnt/ echo "Creating files and switching timestamps" for (( j = 0; j < 50; j ++ )); do mkdir $MNT/dir$j for (( i = 0; i < $FILES; i++ )); do echo "foo" >$MNT/dir$j/file$i done touch -a -t 202501010000 $MNT/dir$j/file* done wait echo "Syncing and flushing" sync echo 3 >/proc/sys/vm/drop_caches echo "Reading all files from a cgroup" mkdir /sys/fs/cgroup/unified/mycg1 || exit echo $$ >/sys/fs/cgroup/unified/mycg1/cgroup.procs || exit for (( j = 0; j < 50; j ++ )); do cat /mnt/dir$j/file* >/dev/null & done wait echo "Switching wbs" # Now rmdir the cgroup after the script exits This can be solved by: - Avoiding contention on the wb->list_lock when switching inodes by running a single work item per wb and managing a queue of items switching to the wb - Allowing rescheduling when switching inodes over to a different cgroup to avoid softlockups - Maintaining the b_dirty list ordering instead of sorting it" * tag 'vfs-6.18-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: writeback: Add tracepoint to track pending inode switches writeback: Avoid excessively long inode switching times writeback: Avoid softlockup when switching many inodes writeback: Avoid contention on wb->list_lock when switching inodes
2025-09-29Merge tag 'namespace-6.18-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull namespace updates from Christian Brauner: "This contains a larger set of changes around the generic namespace infrastructure of the kernel. Each specific namespace type (net, cgroup, mnt, ...) embedds a struct ns_common which carries the reference count of the namespace and so on. We open-coded and cargo-culted so many quirks for each namespace type that it just wasn't scalable anymore. So given there's a bunch of new changes coming in that area I've started cleaning all of this up. The core change is to make it possible to correctly initialize every namespace uniformly and derive the correct initialization settings from the type of the namespace such as namespace operations, namespace type and so on. This leaves the new ns_common_init() function with a single parameter which is the specific namespace type which derives the correct parameters statically. This also means the compiler will yell as soon as someone does something remotely fishy. The ns_common_init() addition also allows us to remove ns_alloc_inum() and drops any special-casing of the initial network namespace in the network namespace initialization code that Linus complained about. Another part is reworking the reference counting. The reference counting was open-coded and copy-pasted for each namespace type even though they all followed the same rules. This also removes all open accesses to the reference count and makes it private and only uses a very small set of dedicated helpers to manipulate them just like we do for e.g., files. In addition this generalizes the mount namespace iteration infrastructure introduced a few cycles ago. As reminder, the vfs makes it possible to iterate sequentially and bidirectionally through all mount namespaces on the system or all mount namespaces that the caller holds privilege over. This allow userspace to iterate over all mounts in all mount namespaces using the listmount() and statmount() system call. Each mount namespace has a unique identifier for the lifetime of the systems that is exposed to userspace. The network namespace also has a unique identifier working exactly the same way. This extends the concept to all other namespace types. The new nstree type makes it possible to lookup namespaces purely by their identifier and to walk the namespace list sequentially and bidirectionally for all namespace types, allowing userspace to iterate through all namespaces. Looking up namespaces in the namespace tree works completely locklessly. This also means we can move the mount namespace onto the generic infrastructure and remove a bunch of code and members from struct mnt_namespace itself. There's a bunch of stuff coming on top of this in the future but for now this uses the generic namespace tree to extend a concept introduced first for pidfs a few cycles ago. For a while now we have supported pidfs file handles for pidfds. This has proven to be very useful. This extends the concept to cover namespaces as well. It is possible to encode and decode namespace file handles using the common name_to_handle_at() and open_by_handle_at() apis. As with pidfs file handles, namespace file handles are exhaustive, meaning it is not required to actually hold a reference to nsfs in able to decode aka open_by_handle_at() a namespace file handle. Instead the FD_NSFS_ROOT constant can be passed which will let the kernel grab a reference to the root of nsfs internally and thus decode the file handle. Namespaces file descriptors can already be derived from pidfds which means they aren't subject to overmount protection bugs. IOW, it's irrelevant if the caller would not have access to an appropriate /proc/<pid>/ns/ directory as they could always just derive the namespace based on a pidfd already. It has the same advantage as pidfds. It's possible to reliably and for the lifetime of the system refer to a namespace without pinning any resources and to compare them trivially. Permission checking is kept simple. If the caller is located in the namespace the file handle refers to they are able to open it otherwise they must hold privilege over the owning namespace of the relevant namespace. The namespace file handle layout is exposed as uapi and has a stable and extensible format. For now it simply contains the namespace identifier, the namespace type, and the inode number. The stable format means that userspace may construct its own namespace file handles without going through name_to_handle_at() as they are already allowed for pidfs and cgroup file handles" * tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (65 commits) ns: drop assert ns: move ns type into struct ns_common nstree: make struct ns_tree private ns: add ns_debug() ns: simplify ns_common_init() further cgroup: add missing ns_common include ns: use inode initializer for initial namespaces selftests/namespaces: verify initial namespace inode numbers ns: rename to __ns_ref nsfs: port to ns_ref_*() helpers net: port to ns_ref_*() helpers uts: port to ns_ref_*() helpers ipv4: use check_net() net: use check_net() net-sysfs: use check_net() user: port to ns_ref_*() helpers time: port to ns_ref_*() helpers pid: port to ns_ref_*() helpers ipc: port to ns_ref_*() helpers cgroup: port to ns_ref_*() helpers ...
2025-09-29Merge tag 'vfs-6.18-rc1.afs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull afs updates from Christian Brauner: "This contains the change to enable afs to support RENAME_NOREPLACE and RENAME_EXCHANGE if the server supports it" * tag 'vfs-6.18-rc1.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: afs: Add support for RENAME_NOREPLACE and RENAME_EXCHANGE
2025-09-29Merge tag 'kernel-6.18-rc1.clone3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull copy_process updates from Christian Brauner: "This contains the changes to enable support for clone3() on nios2 which apparently is still a thing. The more exciting part of this is that it cleans up the inconsistency in how the 64-bit flag argument is passed from copy_process() into the various other copy_*() helpers" [ Fixed up rv ltl_monitor 32-bit support as per Sasha Levin in the merge ] * tag 'kernel-6.18-rc1.clone3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nios2: implement architecture-specific portion of sys_clone3 arch: copy_thread: pass clone_flags as u64 copy_process: pass clone_flags as u64 across calltree copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)
2025-09-29Merge tag 'vfs-6.18-rc1.workqueue' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs workqueue updates from Christian Brauner: "This contains various workqueue changes affecting the filesystem layer. Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This replaces the use of system_wq and system_unbound_wq. system_wq is a per-CPU workqueue which isn't very obvious from the name and system_unbound_wq is to be used when locality is not required. So this renames system_wq to system_percpu_wq, and system_unbound_wq to system_dfl_wq. This also adds a new WQ_PERCPU flag to allow the fs subsystem users to explicitly request the use of per-CPU behavior. Both WQ_UNBOUND and WQ_PERCPU flags coexist for one release cycle to allow callers to transition their calls. WQ_UNBOUND will be removed in a next release cycle" * tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: WQ_PERCPU added to alloc_workqueue users fs: replace use of system_wq with system_percpu_wq fs: replace use of system_unbound_wq with system_dfl_wq
2025-09-29Merge tag 'vfs-6.18-rc1.pidfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull pidfs updates from Christian Brauner: "This just contains a few changes to pid_nr_ns() to make it more robust and cleans up or improves a few users that ab- or misuse it currently" * tag 'vfs-6.18-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: pid: change task_state() to use task_ppid_nr_ns() pid: change bacct_add_tsk() to use task_ppid_nr_ns() pid: make __task_pid_nr_ns(ns => NULL) safe for zombie callers pid: Add a judgment for ns null in pid_nr_ns
2025-09-29Merge tag 'vfs-6.18-rc1.iomap' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs iomap updates from Christian Brauner: "This contains minor fixes and cleanups to the iomap code. Nothing really stands out here" * tag 'vfs-6.18-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: iomap: error out on file IO when there is no inline_data buffer iomap: trace iomap_zero_iter zeroing activities
2025-09-29Merge tag 'vfs-6.18-rc1.inode' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs inode updates from Christian Brauner: "This contains a series I originally wrote and that Eric brought over the finish line. It moves out the i_crypt_info and i_verity_info pointers out of 'struct inode' and into the fs-specific part of the inode. So now the few filesytems that actually make use of this pay the price in their own private inode storage instead of forcing it upon every user of struct inode. The pointer for the crypt and verity info is simply found by storing an offset to its address in struct fsverity_operations and struct fscrypt_operations. This shrinks struct inode by 16 bytes. I hope to move a lot more out of it in the future so that struct inode becomes really just about very core stuff that we need, much like struct dentry and struct file, instead of the dumping ground it has become over the years. On top of this are a various changes associated with the ongoing inode lifetime handling rework that multiple people are pushing forward: - Stop accessing inode->i_count directly in f2fs and gfs2. They simply should use the __iget() and iput() helpers - Make the i_state flags an enum - Rework the iput() logic Currently, if we are the last iput, and we have the I_DIRTY_TIME bit set, we will grab a reference on the inode again and then mark it dirty and then redo the put. This is to make sure we delay the time update for as long as possible We can rework this logic to simply dec i_count if it is not 1, and if it is do the time update while still holding the i_count reference Then we can replace the atomic_dec_and_lock with locking the ->i_lock and doing atomic_dec_and_test, since we did the atomic_add_unless above - Add an icount_read() helper and convert everyone that accesses inode->i_count directly for this purpose to use the helper - Expand dump_inode() to dump more information about an inode helping in debugging - Add some might_sleep() annotations to iput() and associated helpers" * tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: add might_sleep() annotation to iput() and more fs: expand dump_inode() inode: fix whitespace issues fs: add an icount_read helper fs: rework iput logic fs: make the i_state flags an enum fs: stop accessing ->i_count directly in f2fs and gfs2 fsverity: check IS_VERITY() in fsverity_cleanup_inode() fs: remove inode::i_verity_info btrfs: move verity info pointer to fs-specific part of inode f2fs: move verity info pointer to fs-specific part of inode ext4: move verity info pointer to fs-specific part of inode fsverity: add support for info in fs-specific part of inode fs: remove inode::i_crypt_info ceph: move crypt info pointer to fs-specific part of inode ubifs: move crypt info pointer to fs-specific part of inode f2fs: move crypt info pointer to fs-specific part of inode ext4: move crypt info pointer to fs-specific part of inode fscrypt: add support for info in fs-specific part of inode fscrypt: replace raw loads of info pointer with helper function
2025-09-29Merge tag 'vfs-6.18-rc1.mount' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: "This contains some work around mount api handling: - Output the warning message for mnt_too_revealing() triggered during fsmount() to the fscontext log. This makes it possible for the mount tool to output appropriate warnings on the command line. For example, with the newest fsopen()-based mount(8) from util-linux, the error messages now look like: # mount -t proc proc /tmp mount: /tmp: fsmount() failed: VFS: Mount too revealing. dmesg(1) may have more information after failed mount system call. - Do not consume fscontext log entries when returning -EMSGSIZE Userspace generally expects APIs that return -EMSGSIZE to allow for them to adjust their buffer size and retry the operation. However, the fscontext log would previously clear the message even in the -EMSGSIZE case. Given that it is very cheap for us to check whether the buffer is too small before we remove the message from the ring buffer, let's just do that instead. - Drop an unused argument from do_remount()" * tag 'vfs-6.18-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: vfs: fs/namespace.c: remove ms_flags argument from do_remount selftests/filesystems: add basic fscontext log tests fscontext: do not consume log entries when returning -EMSGSIZE vfs: output mount_too_revealing() errors to fscontext docs/vfs: Remove mentions to the old mount API helpers fscontext: add custom-prefix log helpers fs: Remove mount_bdev fs: Remove mount_nodev
2025-09-29mount: handle NULL values in mnt_ns_release()Christian Brauner
When calling in listmount() mnt_ns_release() may be passed a NULL pointer. Handle that case gracefully. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-09-29Merge tag 'vfs-6.18-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Add "initramfs_options" parameter to set initramfs mount options. This allows to add specific mount options to the rootfs to e.g., limit the memory size - Add RWF_NOSIGNAL flag for pwritev2() Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE signal from being raised when writing on disconnected pipes or sockets. The flag is handled directly by the pipe filesystem and converted to the existing MSG_NOSIGNAL flag for sockets - Allow to pass pid namespace as procfs mount option Ever since the introduction of pid namespaces, procfs has had very implicit behaviour surrounding them (the pidns used by a procfs mount is auto-selected based on the mounting process's active pidns, and the pidns itself is basically hidden once the mount has been constructed) This implicit behaviour has historically meant that userspace was required to do some special dances in order to configure the pidns of a procfs mount as desired. Examples include: * In order to bypass the mnt_too_revealing() check, Kubernetes creates a procfs mount from an empty pidns so that user namespaced containers can be nested (without this, the nested containers would fail to mount procfs) But this requires forking off a helper process because you cannot just one-shot this using mount(2) * Container runtimes in general need to fork into a container before configuring its mounts, which can lead to security issues in the case of shared-pidns containers (a privileged process in the pidns can interact with your container runtime process) While SUID_DUMP_DISABLE and user namespaces make this less of an issue, the strict need for this due to a minor uAPI wart is kind of unfortunate Things would be much easier if there was a way for userspace to just specify the pidns they want. So this pull request contains changes to implement a new "pidns" argument which can be set using fsconfig(2): fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd); fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0); or classic mount(2) / mount(8): // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid"); Cleanups: - Remove the last references to EXPORT_OP_ASYNC_LOCK - Make file_remove_privs_flags() static - Remove redundant __GFP_NOWARN when GFP_NOWAIT is used - Use try_cmpxchg() in start_dir_add() - Use try_cmpxchg() in sb_init_done_wq() - Replace offsetof() with struct_size() in ioctl_file_dedupe_range() - Remove vfs_ioctl() export - Replace rwlock() with spinlock in epoll code as rwlock causes priority inversion on preempt rt kernels - Make ns_entries in fs/proc/namespaces const - Use a switch() statement() in init_special_inode() just like we do in may_open() - Use struct_size() in dir_add() in the initramfs code - Use str_plural() in rd_load_image() - Replace strcpy() with strscpy() in find_link() - Rename generic_delete_inode() to inode_just_drop() and generic_drop_inode() to inode_generic_drop() - Remove unused arguments from fcntl_{g,s}et_rw_hint() Fixes: - Document @name parameter for name_contains_dotdot() helper - Fix spelling mistake - Always return zero from replace_fd() instead of the file descriptor number - Limit the size for copy_file_range() in compat mode to prevent a signed overflow - Fix debugfs mount options not being applied - Verify the inode mode when loading it from disk in minixfs - Verify the inode mode when loading it from disk in cramfs - Don't trigger automounts with RESOLVE_NO_XDEV If openat2() was called with RESOLVE_NO_XDEV it didn't traverse through automounts, but could still trigger them - Add FL_RECLAIM flag to show_fl_flags() macro so it appears in tracepoints - Fix unused variable warning in rd_load_image() on s390 - Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD - Use ns_capable_noaudit() when determining net sysctl permissions - Don't call path_put() under namespace semaphore in listmount() and statmount()" * tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits) fcntl: trim arguments listmount: don't call path_put() under namespace semaphore statmount: don't call path_put() under namespace semaphore pid: use ns_capable_noaudit() when determining net sysctl permissions fs: rename generic_delete_inode() and generic_drop_inode() init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD initramfs: Replace strcpy() with strscpy() in find_link() initrd: Use str_plural() in rd_load_image() initramfs: Use struct_size() helper to improve dir_add() initrd: Fix unused variable warning in rd_load_image() on s390 fs: use the switch statement in init_special_inode() fs/proc/namespaces: make ns_entries const filelock: add FL_RECLAIM to show_fl_flags() macro eventpoll: Replace rwlock with spinlock selftests/proc: add tests for new pidns APIs procfs: add "pidns" mount option pidns: move is-ancestor logic to helper openat2: don't trigger automounts with RESOLVE_NO_XDEV namei: move cross-device check to __traverse_mounts namei: remove LOOKUP_NO_XDEV check from handle_mounts ...
2025-09-28smb: server: let smb_direct_flush_send_list() invalidate a remote key firstStefan Metzmacher
If we want to invalidate a remote key we should do that as soon as possible, so do it in the first send work request. Acked-by: Namjae Jeon <linkinjeon@kernel.org> Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-28smb: server: make use of ib_alloc_cq_any() instead of ib_alloc_cq()Stefan Metzmacher
commit 20cf4e026730 ("rdma: Enable ib_alloc_cq to spread work over a device's comp_vectors") happened before ksmbd was upstreamed, but after the out of tree ksmbd (a.k.a. cifsd) was developed. So we still used ib_alloc_cq(). Acked-by: Namjae Jeon <linkinjeon@kernel.org> Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-28smb: server: make consitent use of spin_lock_irq{save,restore}() in ↵Stefan Metzmacher
transport_rdma.c There is a mix of using spin_lock() and spin_lock_irq(), which is confusing as IB_POLL_WORKQUEUE is used and no code would be called from any interrupt. So using spin_lock() or even mutexes would be ok. But we'll soon share common code with the client, which uses IB_POLL_SOFTIRQ. And Documentation/kernel-hacking/locking.rst section "Cheat Sheet For Locking" says: - Otherwise (== data can be touched in an interrupt), use spin_lock_irqsave() and spin_unlock_irqrestore(). So in order to keep it simple and safe we use that version now. It will help merging functions into common code and have consistent locking in all cases. Acked-by: Namjae Jeon <linkinjeon@kernel.org> Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-28smb: server: let ↵Stefan Metzmacher
{free_transport,smb_direct_disconnect_rdma_{work,connection}}() wake up all wait queues This is important in order to let all waiters notice a broken connection. We also go via smb_direct_disconnect_rdma_{work,connection}() for broken connections. Acked-by: Namjae Jeon <linkinjeon@kernel.org> Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-28smb: server: let smb_direct_disconnect_rdma_connection() disable all work ↵Stefan Metzmacher
but disconnect_work There's no point run these if we already know the connection is broken. Acked-by: Namjae Jeon <linkinjeon@kernel.org> Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-28smb: server: fill in smbdirect_socket.first_error on errorStefan Metzmacher
For now we just use -ECONNABORTED, but it will get more detailed later. Acked-by: Namjae Jeon <linkinjeon@kernel.org> Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-28smb: server: let smb_direct_disconnect_rdma_connection() set ↵Stefan Metzmacher
SMBDIRECT_SOCKET_ERROR... smb_direct_disconnect_rdma_connection() should turn the status into an error state instead of leaving it as is until smb_direct_disconnect_rdma_work() is running. Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: Long Li <longli@microsoft.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-09-28smb: server: pass struct smbdirect_socket to ↵Stefan Metzmacher
smb_direct_send_negotiate_response() This will make it easier to move function to the common code in future. Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>