linux.git/fs/exec.c, branch v7.2-rc1

Merge tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2026-06-14T22:29:45+00:00

Pull misc vfs updates from Christian Brauner:
 "Features:

   - Reduce pipe->mutex contention by pre-allocating pages outside the
     lock in anon_pipe_write().

     anon_pipe_write() called alloc_page() once per page while holding
     pipe->mutex. The allocation can sleep doing direct reclaim and runs
     memcg charging, which extends the critical section and stalls any
     concurrent reader on the same mutex. Now up to 8 pages are
     pre-allocated before the mutex is taken, leftovers are recycled
     into the per-pipe tmp_page[] cache before unlock, and any remainder
     is released after unlock, keeping the allocator out of the critical
     section on both sides. On a writers x readers sweep with 64KB
     writes against a 1 MB pipe throughput improves 6-28% and average
     write latency drops 5-22%; under memory pressure - when the cost of
     holding the mutex across reclaim is highest - throughput improves
     21-48% and latency drops 17-33%. The microbenchmark is added to
     selftests.

   - uaccess/sockptr: fix the ignored_trailing logic in
     copy_struct_to_user() to behave as documented and the usize check
     in copy_struct_from_sockptr() for user pointers, and add
     copy_struct_{from,to}_bounce_buffer() and copy_struct_to_sockptr()
     helpers for upcoming users (IPPROTO_SMBDIRECT, IPPROTO_QUIC).

   - bpf: add a sleepable bpf_real_inode() kfunc that resolves the real
     inode backing a dentry via d_real_inode(). On overlayfs the inode
     attached to the dentry doesn't carry the underlying device
     information; this is used by the filesystem restriction BPF program
     that was merged into systemd.

   - docs: add guidelines for submitting new filesystems, motivated by
     the maintenance burden abandoned and untestable filesystems impose
     on VFS developers, blocking infrastructure work like folio
     conversions and iomap migration.

  Fixes:

   - libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
     and drop the now-redundant assignments in callers. This began as a
     one-line dma-buf fix for a path_noexec() warning; a pseudo
     filesystem has no reason not to set SB_I_NOEXEC. All init_pseudo()
     callers were audited: the only visible effect is on dma-buf where
     SB_I_NOEXEC silences the warning.

   - Handle set_blocksize() failures in legacy filesystems (bfs, hpfs,
     qnx4, jfs, befs, affs, isofs, minix, ntfs3, omfs). Mounting a
     device with a sector size > PAGE_SIZE crashed roughly half of them;
     the rest had the same missing error handling pattern. Plus a
     follow-up releasing the superblock buffer_head when setting the
     minix v3 block size fails.

   - mount: honour SB_NOUSER in the new mount API.

   - fs/fcntl: fix a SOFTIRQ-unsafe lock order in fasync signaling by
     switching the process-group paths of send_sigio() and send_sigurg()
     from read_lock(&tasklist_lock) to RCU, matching the single-PID
     path.

   - vfs: add an FS_USERNS_DELEGATABLE flag and set it for NFS, fixing
     delegated NFS mounts (fsopen() in a container with the mount
     performed by a privileged daemon) that broke when non-init
     s_user_ns was tied to FS_USERNS_MOUNT.

   - selftests/namespaces: fix a hang in nsid_test where an unreaped
     grandchild kept the TAP pipe write-end open, a waitpid(-1) race in
     listns_efault_test, and a false FAIL on kernels without listns()
     where the tests should SKIP.

   - filelock: fix the break_lease() stub signature for
     CONFIG_FILE_LOCKING=n.

   - init/initramfs_test: wait for the async initramfs unpacking before
     running; the test and do_populate_rootfs() share the parser state.

   - fs/coredump: reduce redundant log noise in
     validate_coredump_safety().

   - iomap: pass the correct length to fserror_report_io() in
     __iomap_write_begin().

   - backing-file: fix the backing_file_open() kerneldoc.

  Cleanups:

   - initramfs: refactor the cpio hex header parsing to use hex2bin()
     instead of the hand-rolled simple_strntoul() which is reverted, and
     extend the initramfs KUnit tests to cover header fields with 0x
     prefixes.

   - Replace __get_free_pages() and friends with kmalloc()/kzalloc()
     across quota, proc, ocfs2/dlm, nilfs2, nfs, nfsd, libfs, jfs, jbd2,
     isofs, fuse, select, namespace, configfs, binfmt_misc, bfs, and the
     do_mounts init code - part of the larger work of replacing page
     allocator calls with kmalloc().

   - Use clear_and_wake_up_bit() in unlock_buffer() and
     journal_end_buffer_io_sync() instead of open-coding the sequence.

   - Drop unused VFS exports: unexport drop_super_exclusive(), remove
     start_removing_user_path_at(), and fold __start_removing_path()
     into start_removing_path().

   - fs/read_write: narrow the __kernel_write() export with
     EXPORT_SYMBOL_FOR_MODULES().

   - vfs: uapi: retire octal and hex constants in favor of (1 << n) for
     the O_ flags. Finding a free bit for a new flag across the
     architectures was needlessly hard with the mixed bases.

   - dcache: add extra sanity checks of dead dentries in dentry_free()
     via a new DENTRY_WARN_ONCE() that also prints d_flags.

   - iov_iter: use kmemdup_array() in dup_iter() to harden the
     allocation against multiplication overflow.

   - fs/pipe: write to ->poll_usage only once.

   - vfs: remove an always-taken if-branch in find_next_fd().

   - dcache: use kmalloc_flex() for struct external_name in __d_alloc().

   - namei: use QSTR() instead of QSTR_INIT() in path_pts().

   - sync_file_range: delete dead S_ISLNK code.

   - Comment fixes: retire a stale comment in fget_task_next() and fix
     assorted spelling mistakes"

* tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (73 commits)
  backing-file: fix backing_file_open() kerneldoc parameter
  iomap: pass the correct len to fserror_report_io in __iomap_write_begin
  vfs: add FS_USERNS_DELEGATABLE flag and set it for NFS
  filelock: fix break_lease() stub signature for CONFIG_FILE_LOCKING=n
  vfs: uapi: retire octal and hex numbers in favor of (1 << n) for O_ flags
  bpf: add bpf_real_inode() kfunc
  fs/read_write: Do not export __kernel_write() to the entire world
  libfs: drop redundant SB_I_NOEXEC/SB_I_NODEV in init_pseudo() callers
  libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
  mount: honour SB_NOUSER in the new mount API
  fs/fcntl: fix SOFTIRQ-unsafe lock order in fasync signaling
  selftests/pipe: add pipe_bench microbenchmark
  fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
  fs: retire stale comment in fget_task_next()
  fs: fix spelling mistakes in comment
  bfs: replace get_zeroed_page() with kzalloc()
  binfmt_misc: replace __get_free_page() with kmalloc()
  configfs: replace __get_free_pages() with kzalloc()
  fs/namespace: use __getname() to allocate mntpath buffer
  fs/select: replace __get_free_page() with kmalloc()
  ...

exec: free the old mm outside the exec locks

2026-05-26T09:02:02+00:00

exec_mmap() installs the new mm and then tears the old one down while
still holding exec_update_lock for writing -- and with cred_guard_mutex
held all the way to setup_new_exec():

	setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
	mm_update_next_owner(old_mm);
	mmput(old_mm);

Neither lock is needed for this. exec_update_lock only exists to make the
mm swap atomic with the later commit_creds(), so that permission-checking
readers (proc, ptrace, the futex robust list, perf, kcmp, mm_access())
never observe the new mm together with the old credentials. Those readers
all operate on task->mm, i.e. the new mm after the swap; none looks at the
detached old mm, its ->owner or signal->maxrss. cred_guard_mutex guards
credential calculation and is equally irrelevant here.

The cost is real: __mmput() runs exit_mmap() over the entire old address
space and can block in exit_aio() waiting for in-flight AIO, all while
holding exec_update_lock for writing and cred_guard_mutex. For execve() of
a large process this blocks ptrace_attach() and every exec_update_lock
reader for the duration of the teardown.

Stash the old mm in bprm->old_mm and release it from setup_new_exec()
after both locks are dropped. setup_new_exec() still runs before
setup_arg_pages() and the segment mappings, so the old address space is
freed before the new one is populated and peak memory is unchanged. The
ordering constraints are kept: old_mm's mmap_lock is still dropped in
exec_mmap() before mm_update_next_owner() (required since commit
31a78f23bac0 ("mm owner: fix race between swapoff and exit")), and
mm_update_next_owner() still precedes mmput(); both run in the execing
task's context, as mm_update_next_owner() requires.

If exec swaps the mm but fails before setup_new_exec() runs the old mm
would leak, so add a backstop in free_bprm(). The lazy-tlb case
(old_mm == NULL, e.g. kernel_execve()) has no address space to
free and is left in exec_mmap().

Link: https://patch.msgid.link/20260522-work-exit_mm-v1-1-bd32d5a560bb@kernel.org
Signed-off-by: Christian Brauner (Amutable)

exec_state: relocate dumpable information

2026-05-26T09:02:01+00:00

The dumpable flag captured at execve() is consulted by
__ptrace_may_access() and several /proc owner / visibility checks.
It lives on mm_struct today, which exit_mm() clears from the task
long before the task itself is reaped.

exec_state is anchored to the execve() that established the current
privilege domain.  CLONE_VM siblings refcount-share the parent's
exec_state via copy_exec_state(); non-CLONE_VM clones allocate a
fresh exec_state inheriting the parent's dumpable mode and user_ns
reference via task_exec_state_copy().  execve() allocates a fresh
instance (via alloc_task_exec_state() in begin_new_exec()) and
installs it under task_lock + exec_update_lock with
task_exec_state_replace().  init_task uses a static instance.

The dumpable mode now lives on task->exec_state->dumpable.
task->mm->flags no longer carries dumpability; MMF_DUMPABLE_MASK is
removed, but MMF_DUMPABLE_BITS is reserved so MMF_DUMP_FILTER_* bit
positions remain stable for the /proc//coredump_filter ABI. The
task->user_dumpable cache bit and its assignment in exit_mm() are
removed; readers go through get_dumpable(task) directly.

coredump_params gains a snapshot field cprm.dumpable, populated from
get_dumpable(current) at vfs_coredump() entry, replacing the previous
__get_dumpable(cprm->mm_flags) consumers in fs/coredump.c and
fs/pidfs.c.

The user namespace recorded at execve() is consulted by
__ptrace_may_access() and by /proc/PID/* owner derivation. Move the
captured user_ns onto task_exec_state, which stays attached to the task
past exit_mm() and across exit_files().

bprm grows a user_ns field staged in bprm_mm_init() with the caller's
user_ns, narrowed by would_dump() to the closest privileged ancestor,
and consumed by exec_mmap() via alloc_task_exec_state(bprm->user_ns).
free_bprm() releases the staging reference.

mm_struct loses ->user_ns entirely.  Initializers in init-mm, efi_mm,
and the implicit one in mm_init()/dup_mm()/mm_alloc() are removed;
__mmdrop() drops the matching put_user_ns(). The kthread_use_mm()
WARN_ON_ONCE(!mm->user_ns) is no longer meaningful and goes too.

Reviewed-by: Jann Horn 
Link: https://patch.msgid.link/20260520-work-task_exec_state-v3-4-69f895bc1385@kernel.org
Signed-off-by: Christian Brauner (Amutable)

sched/coredump: introduce enum task_dumpable

2026-05-26T09:02:01+00:00

Replace the SUID_DUMP_DISABLE/USER/ROOT preprocessor constants with
enum task_dumpable.  Numeric values are preserved (kernel.suid_dumpable
sysctl and prctl(PR_SET_DUMPABLE) ABI), so this is a pure rename with
no behavioral change.

Subsequent commits relocate dumpability onto a per-task structure
where the enum type will allow stronger type-checking on the new API.

Reviewed-by: Jann Horn 
Reviewed-by: David Hildenbrand (arm) 
Link: https://patch.msgid.link/20260520-work-task_exec_state-v3-1-69f895bc1385@kernel.org
Signed-off-by: Christian Brauner (Amutable)

fs/coredump: reduce redundant log noise in validate_coredump_safety

2026-05-21T11:39:33+00:00

Currently, writing to 'core_pattern' or 'suid_dumpable' sysctl nodes
always triggers validate_coredump_safety(), even if the values have
not changed. This results in redundant warning messages in dmesg:

"Unsafe core_pattern used with fs.suid_dumpable=2..."

This patch optimizes the procfs handlers to only invoke the safety
validation when an actual change in the configuration is detected:

1. In proc_dostring_coredump(), compare the new core_pattern string
   with the existing one using strncmp().
2. In proc_dointvec_minmax_coredump(), check if the new suid_dumpable
   value differs from the previous one.

This keeps the kernel log clean from repetitive warnings when
re-applying the same sysctl settings.

Signed-off-by: Li RongQing 
Link: https://patch.msgid.link/20260410080918.2319-1-lirongqing@baidu.com
Signed-off-by: Christian Brauner (Amutable)

exec: use strnlen() in __set_task_comm

2026-04-01T19:26:07+00:00

Use strnlen() to limit source string scanning to 'TASK_COMM_LEN - 1'
bytes.

Signed-off-by: Thorsten Blum 
Reviewed-by: Jan Kara 
Link: https://patch.msgid.link/20260401152039.724811-3-thorsten.blum@linux.dev
Signed-off-by: Kees Cook

Convert 'alloc_obj' family to use the new default GFP_KERNEL argument

2026-02-22T01:09:51+00:00

This was done entirely with mindless brute force, using

    git grep -l '\

treewide: Replace kmalloc with kmalloc_obj for non-scalar types

2026-02-21T09:02:28+00:00

This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook

Merge tag 'pull-filename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

2026-02-10T00:58:28+00:00

Pull vfs 'struct filename' updates from Al Viro:
 "[Mostly] sanitize struct filename handling"

* tag 'pull-filename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (68 commits)
  sysfs(2): fs_index() argument is _not_ a pathname
  alpha: switch osf_mount() to strndup_user()
  ksmbd: use CLASS(filename_kernel)
  mqueue: switch to CLASS(filename)
  user_statfs(): switch to CLASS(filename)
  statx: switch to CLASS(filename_maybe_null)
  quotactl_block(): switch to CLASS(filename)
  chroot(2): switch to CLASS(filename)
  move_mount(2): switch to CLASS(filename_maybe_null)
  namei.c: switch user pathname imports to CLASS(filename{,_flags})
  namei.c: convert getname_kernel() callers to CLASS(filename_kernel)
  do_f{chmod,chown,access}at(): use CLASS(filename_uflags)
  do_readlinkat(): switch to CLASS(filename_flags)
  do_sys_truncate(): switch to CLASS(filename)
  do_utimes_path(): switch to CLASS(filename_uflags)
  chdir(2): unspaghettify a bit...
  do_fchownat(): unspaghettify a bit...
  fspick(2): use CLASS(filename_flags)
  name_to_handle_at(): use CLASS(filename_uflags)
  vfs_open_tree(): use CLASS(filename_uflags)
  ...

do_open_execat(): don't care about LOOKUP_EMPTY

2026-01-16T17:52:03+00:00

do_file_open() doesn't.

Signed-off-by: Al Viro