linux.git/include/linux/fs.h, branch v7.2-rc1

Merge tag 'mm-stable-2026-06-23-08-55' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

2026-06-23T19:03:44+00:00

Pull more MM updates from Andrew Morton:

 - "khugepaged: add mTHP collapse support" (Nico Pache)

   Provide khugepaged with the capability to collapse anonymous memory
   regions to mTHPs

 - "Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable
   files" (Zi Yan)

   Remove the READ_ONLY_THP_FOR_FS check in file_thp_enabled(), so that
   khugepaged and MADV_COLLAPSE can run on filesystems with PMD THP
   pagecache support even without READ_ONLY_THP_FOR_FS enabled

 - "make MM selftests more CI friendly" (Mike Rapoport)

   General fixes and cleanups to the MM selftests. Also move more MM
   selftests under the kselftest framework, making them more amenable to
   ongoing CI testing

 - "selftests/mm: fix failures and robustness improvements" and
   "selftests/mm: assorted fixes for hmm-tests" (Sayali Patil)

   Fix several issues in MM selftests which were revealed by powerpc 64k
   pagesize

* tag 'mm-stable-2026-06-23-08-55' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (118 commits)
  Revert "mm: limit filemap_fault readahead to VMA boundaries"
  mm/vmscan: pass NULL to trace vmscan node reclaim
  mm: use mapping_mapped to simplify the code
  selftests/mm: fix exclusive_cow test fork() handling
  selftests/mm: remove hardcoded THP sizing assumptions in hmm tests
  selftests/mm: allow PUD-level entries in compound testcase of hmm tests
  mm/gup_test: reject wrapped user ranges
  mm/page_frag: reject invalid CPUs in page_frag_test
  mm/damon/core: always put unsuccessfully committed target pids
  mm: page_isolation: avoid unsafe folio reads while scanning compound pages
  mm/shrinker: do not hold RCU lock in shrinker_debugfs_count_show()
  selftests: mm: fix and speedup "droppable" test
  mm: merge writeout into pageout
  MAINTAINERS: add Hao Ge as reviewer for codetag and alloc_tag
  selftests/mm: clarify alternate unmapping in compaction_test
  selftests/mm: move hwpoison setup into run_test() and silence modprobe output for memory-failure category
  selftests/mm: skip uffd-stress test when nr_pages_per_cpu is zero
  selftests/mm: skip uffd-wp-mremap if UFFD write-protect is unsupported
  selftests/mm: ensure destination is hugetlb-backed in hugetlb-mremap
  selftest/mm: register existing mapping with userfaultfd in hugetlb-mremap
  ...

fs: remove nr_thps from struct address_space

2026-06-21T18:37:15+00:00

filemap_nr_thps*() are removed, the related field, address_space->nr_thps,
is no longer needed.  Remove it.  This shrinks struct address_space by 8
bytes on 64-bit systems which may increase the number of inodes we can
cache.

Link: https://lore.kernel.org/20260517135416.1434539-8-ziy@nvidia.com
Signed-off-by: Zi Yan 
Reviewed-by: Lorenzo Stoakes (Oracle) 
Acked-by: David Hildenbrand (Arm) 
Reviewed-by: Lance Yang 
Reviewed-by: Matthew Wilcox (Oracle) 
Reviewed-by: Baolin Wang 
Reviewed-by: Nico Pache 
Cc: Al Viro 
Cc: Barry Song 
Cc: Chris Mason 
Cc: Christian Brauner 
Cc: David Sterba 
Cc: Dev Jain 
Cc: Jan Kara 
Cc: Liam Howlett 
Cc: Michal Hocko 
Cc: Mike Rapoport 
Cc: Ryan Roberts 
Cc: Shuah Khan 
Cc: Song Liu 
Cc: Suren Baghdasaryan 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton

Merge tag 'vfs-7.2-rc1.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2026-06-14T22:37:58+00:00

Pull procfs updates from Christian Brauner:

 - Revamp fs/filesystems.c

   The file was a mess with a hand-rolled linked list in desperate need
   of a cleanup. The filesystems list is now RCU-ified, /proc files can
   be marked permanent from outside fs/proc/, and the string emitted
   when reading /proc/filesystems is pre-generated and cached instead of
   pointer-chasing and printfing entry by entry on every read.

   The file is read frequently because libselinux reads it and is linked
   into numerous frequently used programs (even ones you would not
   suspect, like sed!). Scalability also improves since reference
   maintenance on open/close is bypassed.

    open+read+close cycle single-threaded (ops/s):
      before: 442732
      after:  1063462 (+140%)

    open+read+close cycle with 20 processes (ops/s):
      before: 606177
      after:  3300576 (+444%)

   A follow-up patch adds missing unlocks in some corner cases and
   tidies things up.

 - Relax the mount visibility check for subset=pid mounts

   When procfs is mounted with subset=pid, all static files become
   unavailable and only the dynamic pid information is accessible. In
   that case there is no point in imposing the full mount visibility
   restrictions on the mounter - everything that can be hidden in procfs
   is already inaccessible. These restrictions prevented procfs from
   being mounted inside rootless containers since almost all container
   implementations overmount parts of procfs to hide certain
   directories.

   As part of this /proc/self/net is only shown in subset=pid mounts for
   CAP_NET_ADMIN, reconfiguring subset=pid is rejected, the
   SB_I_USERNS_VISIBLE superblock flag is replaced with an
   FS_USERNS_MOUNT_RESTRICTED filesystem flag, fully visible mounts are
   recorded in a list, and the mount restrictions are finally
   documented.

 - Protect ptrace_may_access() with exec_update_lock in procfs

   Most uses of ptrace_may_access() in procfs should hold
   exec_update_lock to avoid TOCTOU issues with concurrent privileged
   execve() (like setuid binary execution).

   This fixes the easy cases - the owner and visibility checks and the
   FD link permission checks - with the gnarlier ones to follow later.

* tag 'vfs-7.2-rc1.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: fix ups and tidy ups to /proc/filesystems caching
  proc: protect ptrace_may_access() with exec_update_lock (FD links)
  proc: protect ptrace_may_access() with exec_update_lock (part 1)
  docs: proc: add documentation about mount restrictions
  proc: handle subset=pid separately in userns visibility checks
  proc: prevent reconfiguring subset=pid
  proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
  fs: cache the string generated by reading /proc/filesystems
  sysfs: remove trivial sysfs_get_tree() wrapper
  fs: RCU-ify filesystems list
  fs: move SB_I_USERNS_VISIBLE to FS_USERNS_MOUNT_RESTRICTED
  proc: allow to mark /proc files permanent outside of fs/proc/
  namespace: record fully visible mounts in list

Merge tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2026-06-14T22:29:45+00:00

Pull misc vfs updates from Christian Brauner:
 "Features:

   - Reduce pipe->mutex contention by pre-allocating pages outside the
     lock in anon_pipe_write().

     anon_pipe_write() called alloc_page() once per page while holding
     pipe->mutex. The allocation can sleep doing direct reclaim and runs
     memcg charging, which extends the critical section and stalls any
     concurrent reader on the same mutex. Now up to 8 pages are
     pre-allocated before the mutex is taken, leftovers are recycled
     into the per-pipe tmp_page[] cache before unlock, and any remainder
     is released after unlock, keeping the allocator out of the critical
     section on both sides. On a writers x readers sweep with 64KB
     writes against a 1 MB pipe throughput improves 6-28% and average
     write latency drops 5-22%; under memory pressure - when the cost of
     holding the mutex across reclaim is highest - throughput improves
     21-48% and latency drops 17-33%. The microbenchmark is added to
     selftests.

   - uaccess/sockptr: fix the ignored_trailing logic in
     copy_struct_to_user() to behave as documented and the usize check
     in copy_struct_from_sockptr() for user pointers, and add
     copy_struct_{from,to}_bounce_buffer() and copy_struct_to_sockptr()
     helpers for upcoming users (IPPROTO_SMBDIRECT, IPPROTO_QUIC).

   - bpf: add a sleepable bpf_real_inode() kfunc that resolves the real
     inode backing a dentry via d_real_inode(). On overlayfs the inode
     attached to the dentry doesn't carry the underlying device
     information; this is used by the filesystem restriction BPF program
     that was merged into systemd.

   - docs: add guidelines for submitting new filesystems, motivated by
     the maintenance burden abandoned and untestable filesystems impose
     on VFS developers, blocking infrastructure work like folio
     conversions and iomap migration.

  Fixes:

   - libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
     and drop the now-redundant assignments in callers. This began as a
     one-line dma-buf fix for a path_noexec() warning; a pseudo
     filesystem has no reason not to set SB_I_NOEXEC. All init_pseudo()
     callers were audited: the only visible effect is on dma-buf where
     SB_I_NOEXEC silences the warning.

   - Handle set_blocksize() failures in legacy filesystems (bfs, hpfs,
     qnx4, jfs, befs, affs, isofs, minix, ntfs3, omfs). Mounting a
     device with a sector size > PAGE_SIZE crashed roughly half of them;
     the rest had the same missing error handling pattern. Plus a
     follow-up releasing the superblock buffer_head when setting the
     minix v3 block size fails.

   - mount: honour SB_NOUSER in the new mount API.

   - fs/fcntl: fix a SOFTIRQ-unsafe lock order in fasync signaling by
     switching the process-group paths of send_sigio() and send_sigurg()
     from read_lock(&tasklist_lock) to RCU, matching the single-PID
     path.

   - vfs: add an FS_USERNS_DELEGATABLE flag and set it for NFS, fixing
     delegated NFS mounts (fsopen() in a container with the mount
     performed by a privileged daemon) that broke when non-init
     s_user_ns was tied to FS_USERNS_MOUNT.

   - selftests/namespaces: fix a hang in nsid_test where an unreaped
     grandchild kept the TAP pipe write-end open, a waitpid(-1) race in
     listns_efault_test, and a false FAIL on kernels without listns()
     where the tests should SKIP.

   - filelock: fix the break_lease() stub signature for
     CONFIG_FILE_LOCKING=n.

   - init/initramfs_test: wait for the async initramfs unpacking before
     running; the test and do_populate_rootfs() share the parser state.

   - fs/coredump: reduce redundant log noise in
     validate_coredump_safety().

   - iomap: pass the correct length to fserror_report_io() in
     __iomap_write_begin().

   - backing-file: fix the backing_file_open() kerneldoc.

  Cleanups:

   - initramfs: refactor the cpio hex header parsing to use hex2bin()
     instead of the hand-rolled simple_strntoul() which is reverted, and
     extend the initramfs KUnit tests to cover header fields with 0x
     prefixes.

   - Replace __get_free_pages() and friends with kmalloc()/kzalloc()
     across quota, proc, ocfs2/dlm, nilfs2, nfs, nfsd, libfs, jfs, jbd2,
     isofs, fuse, select, namespace, configfs, binfmt_misc, bfs, and the
     do_mounts init code - part of the larger work of replacing page
     allocator calls with kmalloc().

   - Use clear_and_wake_up_bit() in unlock_buffer() and
     journal_end_buffer_io_sync() instead of open-coding the sequence.

   - Drop unused VFS exports: unexport drop_super_exclusive(), remove
     start_removing_user_path_at(), and fold __start_removing_path()
     into start_removing_path().

   - fs/read_write: narrow the __kernel_write() export with
     EXPORT_SYMBOL_FOR_MODULES().

   - vfs: uapi: retire octal and hex constants in favor of (1 << n) for
     the O_ flags. Finding a free bit for a new flag across the
     architectures was needlessly hard with the mixed bases.

   - dcache: add extra sanity checks of dead dentries in dentry_free()
     via a new DENTRY_WARN_ONCE() that also prints d_flags.

   - iov_iter: use kmemdup_array() in dup_iter() to harden the
     allocation against multiplication overflow.

   - fs/pipe: write to ->poll_usage only once.

   - vfs: remove an always-taken if-branch in find_next_fd().

   - dcache: use kmalloc_flex() for struct external_name in __d_alloc().

   - namei: use QSTR() instead of QSTR_INIT() in path_pts().

   - sync_file_range: delete dead S_ISLNK code.

   - Comment fixes: retire a stale comment in fget_task_next() and fix
     assorted spelling mistakes"

* tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (73 commits)
  backing-file: fix backing_file_open() kerneldoc parameter
  iomap: pass the correct len to fserror_report_io in __iomap_write_begin
  vfs: add FS_USERNS_DELEGATABLE flag and set it for NFS
  filelock: fix break_lease() stub signature for CONFIG_FILE_LOCKING=n
  vfs: uapi: retire octal and hex numbers in favor of (1 << n) for O_ flags
  bpf: add bpf_real_inode() kfunc
  fs/read_write: Do not export __kernel_write() to the entire world
  libfs: drop redundant SB_I_NOEXEC/SB_I_NODEV in init_pseudo() callers
  libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
  mount: honour SB_NOUSER in the new mount API
  fs/fcntl: fix SOFTIRQ-unsafe lock order in fasync signaling
  selftests/pipe: add pipe_bench microbenchmark
  fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
  fs: retire stale comment in fget_task_next()
  fs: fix spelling mistakes in comment
  bfs: replace get_zeroed_page() with kzalloc()
  binfmt_misc: replace __get_free_page() with kmalloc()
  configfs: replace __get_free_pages() with kzalloc()
  fs/namespace: use __getname() to allocate mntpath buffer
  fs/select: replace __get_free_page() with kmalloc()
  ...

Merge tag 'vfs-7.2-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2026-06-14T22:00:45+00:00

Pull vfs writeback updates from Christian Brauner:

 - Fix a race between cgroup_writeback_umount() and inode_switch_wbs()

   When a container exits, a race between cgroup_writeback_umount() and
   inode_switch_wbs()/cleanup_offline_cgwb() can trigger "VFS: Busy
   inodes after unmount" followed by a use-after-free on percpu
   counters.

   There is a window between inode_prepare_wbs_switch() returning true
   (having passed the SB_ACTIVE check and grabbed the inode) and the
   subsequent wb_queue_isw() call: if cgroup_writeback_umount() observes
   the global isw_nr_in_flight counter as non-zero but flush_workqueue()
   finds nothing queued yet, it returns early - leaving a held inode
   reference that blocks evict_inodes() and a later iput() that hits
   freed percpu counters.

   The race is closed by covering the window from
   inode_prepare_wbs_switch() through wb_queue_isw() with an RCU
   read-side critical section and synchronizing in the umount path.

   On top of that the now-dead rcu_barrier() left over from the
   queue_rcu_work() era is removed, and the global
   synchronize_rcu()/flush_workqueue() pair is replaced with a per-sb
   in-flight counter plus pin/unpin/drain helpers so umount no longer
   serializes against switch activity on unrelated superblocks.

   Under cgroup writeback churn on a 16 vCPU guest this takes umount
   latency from ~92-138ms p50 down to ~5-8ms p50 and the cumulative cost
   of cgroup_writeback_umount() from ~62ms to ~4us per call.

   The initial race fix is kept separate and minimal so it backports
   cleanly to stable trees that still queue switches via
   queue_rcu_work().

 - Improve write performance with RWF_DONTCACHE

   Dirty DONTCACHE pages are now tracked per bdi_writeback so that the
   writeback flusher can be kicked in a targeted fashion for
   IOCB_DONTCACHE writes instead of relying on global writeback, and the
   PG_dropbehind flag is preserved when a folio is split.

* tag 'vfs-7.2-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
  mm: track DONTCACHE dirty pages per bdi_writeback
  mm: preserve PG_dropbehind flag during folio split
  writeback: use a per-sb counter to drain inode wb switches at umount
  writeback: drop now-unnecessary rcu_barrier() in cgroup_writeback_umount()
  writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()

Merge tag 'vfs-7.2-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2026-06-14T21:55:36+00:00

Pull vfs superblock updates from Christian Brauner:
 "This retires sget().

  CIFS plus the two ext4 KUnit tests (extents-test, mballoc-test) were
  the last in-tree callers, and all three convert cleanly to sget_fc().

  That lets sget() and its prototype come out, taking ~60 lines that
  only existed to be kept in lockstep with sget_fc() on every
  publish-path change"

* tag 'vfs-7.2-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: retire sget()
  smb: client: convert cifs_smb3_do_mount() to sget_fc()
  ext4: convert mballoc KUnit test to sget_fc()
  ext4: convert extents KUnit test to sget_fc()

vfs: add FS_USERNS_DELEGATABLE flag and set it for NFS

2026-06-09T15:17:29+00:00

Commit e1c5ae59c0f2 ("fs: don't allow non-init s_user_ns for filesystems
without FS_USERNS_MOUNT") prevents the mount of any filesystem inside a
container that doesn't have FS_USERNS_MOUNT set.

This broke NFS mounts in our containerized environment. We have a daemon
somewhat like systemd-mountfsd running in the init_ns. A process does a
fsopen() inside the container and passes it to the daemon via unix
socket.

The daemon then vets that the request is for an allowed NFS server and
performs the mount. This now fails because the fc->user_ns is set to the
value in the container and NFS doesn't set FS_USERNS_MOUNT.  We don't
want to add FS_USERNS_MOUNT to NFS since that would allow the container
to mount any NFS server (even malicious ones).

Add a new FS_USERNS_DELEGATABLE flag, and enable it on NFS.

Fixes: e1c5ae59c0f2 ("fs: don't allow non-init s_user_ns for filesystems without FS_USERNS_MOUNT")
Signed-off-by: Jeff Layton 
Link: https://patch.msgid.link/20260129-twmount-v1-1-4874ed2a15c4@kernel.org
Acked-by: Anna Schumaker 
Reviewed-by: Alexander Mikhalitsyn 
Reviewed-by: Jeff Layton 
Signed-off-by: Christian Brauner (Amutable)

mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking

2026-06-04T08:16:51+00:00

The IOCB_DONTCACHE writeback path in generic_write_sync() calls
filemap_flush_range() on every write, submitting writeback inline in
the writer's context.  Perf lock contention profiling shows the
performance problem is not lock contention but the writeback submission
work itself — walking the page tree and submitting I/O blocks the writer
for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
(dontcache).

Replace the inline filemap_flush_range() call with a flusher kick that
drains dirty pages in the background.  This moves writeback submission
completely off the writer's hot path.

To avoid flushing unrelated buffered dirty data, add a dedicated
WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
the per-wb WB_DONTCACHE_DIRTY counter to determine how many pages to
write back.  The flusher writes back that many pages from the oldest dirty
inodes (not restricted to dontcache-specific inodes). This helps
preserve I/O batching while limiting the scope of expedited writeback.

Like WB_start_all, the WB_start_dontcache bit coalesces multiple
DONTCACHE writes into a single flusher wakeup without per-write
allocations.  Use test_and_clear_bit to atomically consume the kick
request before reading the dirty counter and starting writeback, so that
concurrent DONTCACHE writes during writeback can re-set the bit and
schedule a follow-up flusher run.

Read the dirty counter with wb_stat_sum() (aggregating per-CPU batches)
rather than wb_stat() (which reads only the global counter) to ensure
small writes below the percpu batch threshold are visible to the flusher.

In filemap_dontcache_kick_writeback(), set the WB_start_dontcache bit
inside the unlocked_inode_to_wb_begin/end section for correct cgroup
writeback domain targeting, but defer the wb_wakeup() call until after
the section ends, since wb_wakeup() uses spin_unlock_irq() which would
unconditionally re-enable interrupts while the i_pages xa_lock may still
be held under irqsave during a cgroup writeback switch. Pin the wb with
wb_get() inside the RCU critical section before calling wb_wakeup()
outside it, since cgroup bdi_writeback structures are RCU-freed and the
wb pointer could become invalid after unlocked_inode_to_wb_end() drops
the RCU read lock.

Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
visibility.

dontcache-bench results (same host, T6F_SKL_1920GBF, 251 GiB RAM,
xfs on NVMe, fio io_uring):

Buffered and direct I/O paths are unaffected by this patchset. All
improvements are confined to the dontcache path:

Single-stream throughput (MB/s):
                        Before    After    Change
  seq-write/dontcache      298      897    +201%
  rand-write/dontcache     131      236     +80%

Tail latency improvements (seq-write/dontcache):
  p99:    135,266 us  ->  23,986 us   (-82%)
  p99.9: 8,925,479 us ->  28,443 us   (-99.7%)

Multi-writer (4 jobs, sequential write):
                                Before    After    Change
  dontcache aggregate (MB/s)     2,529    4,532     +79%
  dontcache p99 (us)             8,553    1,002     -88%
  dontcache p99.9 (us)         109,314    1,057     -99%

  Dontcache multi-writer throughput now matches buffered (4,532 vs
  4,616 MB/s).

32-file write (Axboe test):
                                Before    After    Change
  dontcache aggregate (MB/s)     1,548    3,499    +126%
  dontcache p99 (us)            10,170      602     -94%
  Peak dirty pages (MB)          1,837      213     -88%

  Dontcache now reaches 81% of buffered throughput (was 35%).

Competing writers (dontcache vs buffered, separate files):
                                Before    After
  buffered writer                  868      433 MB/s
  dontcache writer                 415      433 MB/s
  Aggregate                      1,284      866 MB/s

  Previously the buffered writer starved the dontcache writer 2:1.
  With per-bdi_writeback tracking, both writers now receive equal
  bandwidth. The aggregate matches the buffered-vs-buffered baseline
  (863 MB/s), indicating fair sharing regardless of I/O mode.

  The dontcache writer's p99.9 latency collapsed from 119 ms to
  33 ms (-73%), eliminating the severe periodic stalls seen in the
  baseline. Both writers now share identical latency profiles,
  matching the buffered-vs-buffered pattern.

The per-bdi_writeback dirty tracking dramatically reduces peak dirty
pages in dontcache workloads, with the 32-file test dropping from
1.8 GB to 213 MB. Dontcache sequential write throughput triples and
multi-writer throughput reaches parity with buffered I/O, with tail
latencies collapsing by 1-2 orders of magnitude.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Jeff Layton 
Link: https://patch.msgid.link/20260511-dontcache-v7-3-2848ddce8090@kernel.org
Reviewed-by: Jan Kara 
Reviewed-by: Ritesh Harjani (IBM) 
Signed-off-by: Christian Brauner (Amutable)

fs: retire sget()

2026-06-03T07:09:51+00:00

sget() and sget_fc() have lived side by side as near-duplicate
find-or-create-and-publish helpers for the legacy and fs_context mount
APIs. The three remaining in-tree callers (CIFS plus the ext4 extents
and mballoc KUnit tests) have all been moved to sget_fc(). Nothing
calls sget() anymore.

Delete sget() from fs/super.c and the prototype in .
Update the two comments that referred to "sget()" or "sget{_fc}()" to
just say "sget_fc()".

This removes ~60 lines of code that only existed to be kept in
lockstep with sget_fc() on every superblock publish-path change.

Link: https://patch.msgid.link/20260529-work-sget-v2-4-57bbe08604e4@kernel.org
Reviewed-by: Jan Kara 
Signed-off-by: Christian Brauner (Amutable)

fs: move SB_I_USERNS_VISIBLE to FS_USERNS_MOUNT_RESTRICTED

2026-05-11T21:13:01+00:00

Whether a filesystem's mounts need to undergo a visibility check in user
namespaces is a static property of the filesystem type, not a runtime
property of each superblock instance. Both proc and sysfs always set
SB_I_USERNS_VISIBLE on their superblocks unconditionally (sysfs does so
on first creation, and subsequent mounts reuse the same superblock).

Move this flag from sb->s_iflags (SB_I_USERNS_VISIBLE) to
file_system_type->fs_flags (FS_USERNS_MOUNT_RESTRICTED) so the intent
is expressed at the filesystem type level where it belongs.

All check sites are updated to test sb->s_type->fs_flags instead of
sb->s_iflags. The SB_I_NOEXEC and SB_I_NODEV flags remain on the
superblock as they are runtime properties set during fill_super.

Link: https://patch.msgid.link/72887c5b6204dc3adf5a53104f0be6bd8bc4f6cd.1777278334.git.legion@kernel.org
Reviewed-by: Aleksa Sarai 
Signed-off-by: Christian Brauner