linux-stable.git/include/linux/binfmts.h, branch v5.4.299

coredump: hand a pidfd to the usermode coredump helper

2025-06-04T12:32:36+00:00

commit b5325b2a270fcaf7b2a9a0f23d422ca8a5a8bdea upstream.

Give userspace a way to instruct the kernel to install a pidfd into the
usermode helper process. This makes coredump handling a lot more
reliable for userspace. In parallel with this commit we already have
systemd adding support for this in [1].

We create a pidfs file for the coredumping process when we process the
corename pattern. When the usermode helper process is forked we then
install the pidfs file as file descriptor three into the usermode
helpers file descriptor table so it's available to the exec'd program.

Since usermode helpers are either children of the system_unbound_wq
workqueue or kthreadd we know that the file descriptor table is empty
and can thus always use three as the file descriptor number.

Note, that we'll install a pidfd for the thread-group leader even if a
subthread is calling do_coredump(). We know that task linkage hasn't
been removed due to delay_group_leader() and even if this @current isn't
the actual thread-group leader we know that the thread-group leader
cannot be reaped until @current has exited.

[brauner: This is a backport for the v5.4 series. Upstream has
significantly changed and backporting all that infra is a non-starter.
So simply backport the pidfd_prepare() helper and waste the file
descriptor we allocated. Then we minimally massage the umh coredump
setup code.]

Link: https://github.com/systemd/systemd/pull/37125 [1]
Link: https://lore.kernel.org/20250414-work-coredump-v2-3-685bf231f828@kernel.org
Tested-by: Luca Boccassi 
Reviewed-by: Oleg Nesterov 
Signed-off-by: Christian Brauner 
Signed-off-by: Greg Kroah-Hartman

exec: Add exec_update_mutex to replace cred_guard_mutex

2020-10-01T11:17:47+00:00

[ Upstream commit eea9673250db4e854e9998ef9da6d4584857f0ea ]

The cred_guard_mutex is problematic as it is held over possibly
indefinite waits for userspace. The possible indefinite waits for
userspace that I have identified are: The cred_guard_mutex is held in
PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is
held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The
cred_guard_mutex is held over "get_user(futex_offset, ...") in
exit_robust_list. The cred_guard_mutex held over copy_strings.

The functions get_user and put_user can trigger a page fault which can
potentially wait indefinitely in the case of userfaultfd or if
userspace implements part of the page fault path.

In any of those cases the userspace process that the kernel is waiting
for might make a different system call that winds up taking the
cred_guard_mutex and result in deadlock.

Holding a mutex over any of those possibly indefinite waits for
userspace does not appear necessary. Add exec_update_mutex that will
just cover updating the process during exec where the permissions and
the objects pointed to by the task struct may be out of sync.

The plan is to switch the users of cred_guard_mutex to
exec_update_mutex one by one. This lets us move forward while still
being careful and not introducing any regressions.

Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
Reviewed-by: Kirill Tkhai
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Bernd Edlinger
Signed-off-by: Eric W. Biederman
Signed-off-by: Sasha Levin

exec: move struct linux_binprm::buf

2019-05-15T02:52:50+00:00

struct linux_binprm::buf is the first field and it is exactly 128 bytes
in size.  It means that on x86_64 all accesses to other fields will go
though [r64 + disp32] addressing mode which is 3 bytes bloatier than
[r64 + disp8] addressing mode.  Given that accesses to other fields
outnumber accesses to ->buf, move it down.

Space savings (x86_64 defconfig):
more on distro configs because LSMs actively dereference "bprm"
but do not care about first 128 bytes of the executable itself.

add/remove: 0/0 grow/shrink: 0/24 up/down: 0/-492 (-492)
Function                                     old     new   delta
selinux_bprm_committing_creds                552     549      -3
finalize_exec                                 94      91      -3
__audit_log_bprm_fcaps                       283     280      -3
__audit_bprm                                  39      36      -3
perf_trace_sched_process_exec                347     341      -6
install_exec_creds                           105      99      -6
cap_bprm_set_creds.cold                       60      54      -6
would_dump                                   137     128      -9
load_script                                  637     628      -9
bprm_change_interp                            61      52      -9
trace_event_raw_event_sched_process_exec     260     250     -10
search_binary_handler                        255     240     -15
remove_arg_zero                              295     277     -18
free_bprm                                    119     101     -18
prepare_binprm                               379     360     -19
setup_new_exec                               336     315     -21
flush_old_exec                              1638    1617     -21
copy_strings.isra                            746     724     -22
setup_arg_pages                              559     530     -29
load_misc_binary                            1151    1118     -33
selinux_bprm_set_creds                       792     753     -39
load_elf_binary                            11111   11072     -39
cap_bprm_set_creds                          1496    1454     -42
__do_execve_file.isra                       2395    2286    -109

Link: http://lkml.kernel.org/r/20190421165025.GA26843@avx2
Signed-off-by: Alexey Dobriyan 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

2019-01-05T21:18:59+00:00

Pull trivial vfs updates from Al Viro:
 "A few cleanups + Neil's namespace_unlock() optimization"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  exec: make prepare_bprm_creds static
  genheaders: %-s had been there since v6; %-*s - since v7
  VFS: use synchronize_rcu_expedited() in namespace_unlock()
  iov_iter: reduce code duplication

exec: separate MM_ANONPAGES and RLIMIT_STACK accounting

2019-01-04T21:13:47+00:00

get_arg_page() checks bprm->rlim_stack.rlim_cur and re-calculates the
"extra" size for argv/envp pointers every time, this is a bit ugly and
even not strictly correct: acct_arg_size() must not account this size.

Remove all the rlimit code in get_arg_page().  Instead, add bprm->argmin
calculated once at the start of __do_execve_file() and change
copy_strings to check bprm->p >= bprm->argmin.

The patch adds the new helper, prepare_arg_pages() which initializes
bprm->argc/envc and bprm->argmin.

[oleg@redhat.com: fix !CONFIG_MMU version of get_arg_page()]
  Link: http://lkml.kernel.org/r/20181126122307.GA1660@redhat.com
[akpm@linux-foundation.org: use max_t]
Link: http://lkml.kernel.org/r/20181112160910.GA28440@redhat.com
Signed-off-by: Oleg Nesterov 
Acked-by: Kees Cook 
Tested-by: Guenter Roeck 
Cc: "Eric W. Biederman" 
Cc: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

exec: make prepare_bprm_creds static

2018-12-10T09:11:06+00:00

prepare_bprm_creds is not used outside exec.c, so there's no reason for it
to have external linkage.

Signed-off-by: Chanho Min 
Signed-off-by: Al Viro

signal: Distinguish between kernel_siginfo and siginfo

2018-10-03T14:47:43+00:00

Linus recently observed that if we did not worry about the padding
member in struct siginfo it is only about 48 bytes, and 48 bytes is
much nicer than 128 bytes for allocating on the stack and copying
around in the kernel.

The obvious thing of only adding the padding when userspace is
including siginfo.h won't work as there are sigframe definitions in
the kernel that embed struct siginfo.

So split siginfo in two; kernel_siginfo and siginfo.  Keeping the
traditional name for the userspace definition.  While the version that
is used internally to the kernel and ultimately will not be padded to
128 bytes is called kernel_siginfo.

The definition of struct kernel_siginfo I have put in include/signal_types.h

A set of buildtime checks has been added to verify the two structures have
the same field offsets.

To make it easy to verify the change kernel_siginfo retains the same
size as siginfo.  The reduction in size comes in a following change.

Signed-off-by: "Eric W. Biederman"

umh: introduce fork_usermode_blob() helper

2018-05-23T17:23:39+00:00

Introduce helper:
int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
struct umh_info {
       struct file *pipe_to_umh;
       struct file *pipe_from_umh;
       pid_t pid;
};

that GPLed kernel modules (signed or unsigned) can use it to execute part
of its own data as swappable user mode process.

The kernel will do:
- allocate a unique file in tmpfs
- populate that file with [data, data + len] bytes
- user-mode-helper code will do_execve that file and, before the process
  starts, the kernel will create two unix pipes for bidirectional
  communication between kernel module and umh
- close tmpfs file, effectively deleting it
- the fork_usermode_blob will return zero on success and populate
  'struct umh_info' with two unix pipes and the pid of the user process

As the first step in the development of the bpfilter project
the fork_usermode_blob() helper is introduced to allow user mode code
to be invoked from a kernel module. The idea is that user mode code plus
normal kernel module code are built as part of the kernel build
and installed as traditional kernel module into distro specified location,
such that from a distribution point of view, there is
no difference between regular kernel modules and kernel modules + umh code.
Such modules can be signed, modprobed, rmmod, etc. The use of this new helper
by a kernel module doesn't make it any special from kernel and user space
tooling point of view.

Such approach enables kernel to delegate functionality traditionally done
by the kernel modules into the user space processes (either root or !root) and
reduces security attack surface of the new code. The buggy umh code would crash
the user process, but not the kernel. Another advantage is that umh code
of the kernel module can be debugged and tested out of user space
(e.g. opening the possibility to run clang sanitizers, fuzzers or
user space test suites on the umh code).
In case of the bpfilter project such architecture allows complex control plane
to be done in the user space while bpf based data plane stays in the kernel.

Since umh can crash, can be oom-ed by the kernel, killed by the admin,
the kernel module that uses them (like bpfilter) needs to manage life
time of umh on its own via two unix pipes and the pid of umh.

The exit code of such kernel module should kill the umh it started,
so that rmmod of the kernel module will cleanup the corresponding umh.
Just like if the kernel module does kmalloc() it should kfree() it
in the exit code.

Signed-off-by: Alexei Starovoitov 
Signed-off-by: David S. Miller

exec: pin stack limit during exec

2018-04-11T17:28:37+00:00

Since the stack rlimit is used in multiple places during exec and it can
be changed via other threads (via setrlimit()) or processes (via
prlimit()), the assumption that the value doesn't change cannot be made.
This leads to races with mm layout selection and argument size
calculations.  This changes the exec path to use the rlimit stored in
bprm instead of in current.  Before starting the thread, the bprm stack
rlimit is stored back to current.

Link: http://lkml.kernel.org/r/1518638796-20819-4-git-send-email-keescook@chromium.org
Fixes: 64701dee4178e ("exec: Use sane stack rlimit under secureexec")
Signed-off-by: Kees Cook 
Reported-by: Ben Hutchings 
Reported-by: Andy Lutomirski 
Reported-by: Brad Spengler 
Acked-by: Michal Hocko 
Cc: Ben Hutchings 
Cc: Greg KH 
Cc: Hugh Dickins 
Cc: "Jason A. Donenfeld" 
Cc: Laura Abbott 
Cc: Oleg Nesterov 
Cc: Rik van Riel 
Cc: Willy Tarreau 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

exec: introduce finalize_exec() before start_thread()

2018-04-11T17:28:37+00:00

Provide a final callback into fs/exec.c before start_thread() takes
over, to handle any last-minute changes, like the coming restoration of
the stack limit.

Link: http://lkml.kernel.org/r/1518638796-20819-3-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook 
Cc: Andy Lutomirski 
Cc: Ben Hutchings 
Cc: Ben Hutchings 
Cc: Brad Spengler 
Cc: Greg KH 
Cc: Hugh Dickins 
Cc: "Jason A. Donenfeld" 
Cc: Laura Abbott 
Cc: Michal Hocko 
Cc: Oleg Nesterov 
Cc: Rik van Riel 
Cc: Willy Tarreau 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds