summaryrefslogtreecommitdiff
path: root/include/uapi/linux
diff options
context:
space:
mode:
authorChristian Brauner <brauner@kernel.org>2026-03-11 23:24:31 +0100
committerChristian Brauner <brauner@kernel.org>2026-03-11 23:24:31 +0100
commita2900f5aa4e7724a2bf56ddb1a4f816d4fcc0598 (patch)
tree2b7fd1e08c4e2a055203a2921f2c7a2348a3ffbb /include/uapi/linux
parent6de23f81a5e08be8fbf5e8d7e9febc72a5b5f27f (diff)
parentec26879e6d89983b31fdb27d149854f42ee8d689 (diff)
Merge patch series "pidfd: add CLONE_AUTOREAP, CLONE_NNP, and CLONE_PIDFD_AUTOKILL"
Christian Brauner <brauner@kernel.org> says: Add three new clone3() flags for pidfd-based process lifecycle management. === CLONE_AUTOREAP === CLONE_AUTOREAP makes a child process auto-reap on exit without ever becoming a zombie. This is a per-process property in contrast to the existing auto-reap mechanism via SA_NOCLDWAIT or SIG_IGN for SIGCHLD which applies to all children of a given parent. Currently the only way to automatically reap children is to set SA_NOCLDWAIT or SIG_IGN on SIGCHLD. This is a parent-scoped property affecting all children which makes it unsuitable for libraries or applications that need selective auto-reaping of specific children while still being able to wait() on others. CLONE_AUTOREAP stores an autoreap flag in the child's signal_struct. When the child exits do_notify_parent() checks this flag and causes exit_notify() to transition the task directly to EXIT_DEAD. Since the flag lives on the child it survives reparenting: if the original parent exits and the child is reparented to a subreaper or init the child still auto-reaps when it eventually exits. This is cleaner than forcing the subreaper to get SIGCHLD and then reaping it. If the parent doesn't care the subreaper won't care. If there's a subreaper that would care it would be easy enough to add a prctl() that either just turns back on SIGCHLD and turns off auto-reaping or a prctl() that just notifies the subreaper whenever a child is reparented to it. CLONE_AUTOREAP can be combined with CLONE_PIDFD to allow the parent to monitor the child's exit via poll() and retrieve exit status via PIDFD_GET_INFO. Without CLONE_PIDFD it provides a fire-and-forget pattern. No exit signal is delivered so exit_signal must be zero. CLONE_THREAD and CLONE_PARENT are rejected: CLONE_THREAD because autoreap is a process-level property, and CLONE_PARENT because an autoreap child reparented via CLONE_PARENT could become an invisible zombie under a parent that never calls wait(). The flag is not inherited by the autoreap process's own children. Each child that should be autoreaped must be explicitly created with CLONE_AUTOREAP. === CLONE_NNP === CLONE_NNP sets no_new_privs on the child at clone time. Unlike prctl(PR_SET_NO_NEW_PRIVS) which a process sets on itself, CLONE_NNP allows the parent to impose no_new_privs on the child at creation without affecting the parent's own privileges. CLONE_THREAD is rejected because threads share credentials. CLONE_NNP is useful on its own for any spawn-and-sandbox pattern but was specifically introduced to enable unprivileged usage of CLONE_PIDFD_AUTOKILL. === CLONE_PIDFD_AUTOKILL === This flag ties a child's lifetime to the pidfd returned from clone3(). When the last reference to the struct file created by clone3() is closed the kernel sends SIGKILL to the child. A pidfd obtained via pidfd_open() for the same process does not keep the child alive and does not trigger autokill - only the specific struct file from clone3() has this property. This is useful for container runtimes, service managers, and sandboxed subprocess execution - any scenario where the child must die if the parent crashes or abandons the pidfd or just wants a throwaway helper process. CLONE_PIDFD_AUTOKILL requires both CLONE_PIDFD and CLONE_AUTOREAP. It requires CLONE_PIDFD because the whole point is tying the child's lifetime to the pidfd. It requires CLONE_AUTOREAP because a killed child with no one to reap it would become a zombie - the primary use case is the parent crashing or abandoning the pidfd so no one is around to call waitpid(). CLONE_THREAD is rejected because autokill targets a process not a thread. If CLONE_NNP is specified together with CLONE_PIDFD_AUTOKILL an unprivileged user may spawn a process that is autokilled. The child cannot escalate privileges via setuid/setgid exec after being spawned. If CLONE_PIDFD_AUTOKILL is specified without CLONE_NNP the caller must have have CAP_SYS_ADMIN in its user namespace. * patches from https://patch.msgid.link/20260226-work-pidfs-autoreap-v5-0-d148b984a989@kernel.org: selftests/pidfd: add CLONE_PIDFD_AUTOKILL tests selftests/pidfd: add CLONE_NNP tests selftests/pidfd: add CLONE_AUTOREAP tests pidfd: add CLONE_PIDFD_AUTOKILL clone: add CLONE_NNP clone: add CLONE_AUTOREAP Link: https://patch.msgid.link/20260226-work-pidfs-autoreap-v5-0-d148b984a989@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
Diffstat (limited to 'include/uapi/linux')
-rw-r--r--include/uapi/linux/pidfd.h1
-rw-r--r--include/uapi/linux/sched.h7
2 files changed, 6 insertions, 2 deletions
diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
index ea9a6811fc76..9281956a9f32 100644
--- a/include/uapi/linux/pidfd.h
+++ b/include/uapi/linux/pidfd.h
@@ -13,6 +13,7 @@
#ifdef __KERNEL__
#include <linux/sched.h>
#define PIDFD_STALE CLONE_PIDFD
+#define PIDFD_AUTOKILL O_TRUNC
#endif
/* Flags for pidfd_send_signal(). */
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 359a14cc76a4..149dbc64923b 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -34,8 +34,11 @@
#define CLONE_IO 0x80000000 /* Clone io context */
/* Flags for the clone3() syscall. */
-#define CLONE_CLEAR_SIGHAND 0x100000000ULL /* Clear any signal handler and reset to SIG_DFL. */
-#define CLONE_INTO_CGROUP 0x200000000ULL /* Clone into a specific cgroup given the right permissions. */
+#define CLONE_CLEAR_SIGHAND (1ULL << 32) /* Clear any signal handler and reset to SIG_DFL. */
+#define CLONE_INTO_CGROUP (1ULL << 33) /* Clone into a specific cgroup given the right permissions. */
+#define CLONE_AUTOREAP (1ULL << 34) /* Auto-reap child on exit. */
+#define CLONE_NNP (1ULL << 35) /* Set no_new_privs on child. */
+#define CLONE_PIDFD_AUTOKILL (1ULL << 36) /* Kill child when clone pidfd closes. */
/*
* cloning flags intersect with CSIGNAL so can be used with unshare and clone3