<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/fs/binfmt_misc.c, branch v6.7</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>Merge tag 'execve-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux</title>
<updated>2023-10-31T05:28:19+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-10-31T05:28:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=d82c0a37d431ada0d1dae9a2665fcfe17b0f9e14'/>
<id>d82c0a37d431ada0d1dae9a2665fcfe17b0f9e14</id>
<content type='text'>
Pull execve updates from Kees Cook:

 - Support non-BSS ELF segments with zero filesz

   Eric Biederman and I refactored ELF segment loading to handle the
   case where a segment has a smaller filesz than memsz. Traditionally
   linkers only did this for .bss and it was always the last segment. As
   a result, the kernel only handled this case when it was the last
   segment. We've had two recent cases where linkers were trying to use
   these kinds of segments for other reasons, and the were in the middle
   of the segment list. There was no good reason for the kernel not to
   support this, and the refactor actually ends up making things more
   readable too.

 - Enable namespaced binfmt_misc

   Christian Brauner has made it possible to use binfmt_misc with mount
   namespaces. This means some traditionally root-only interfaces (for
   adding/removing formats) are now more exposed (but believed to be
   safe).

 - Remove struct tag 'dynamic' from ELF UAPI

   Alejandro Colomar noticed that the ELF UAPI has been polluting the
   struct namespace with an unused and overly generic tag named
   "dynamic" for no discernible reason for many many years. After
   double-checking various distro source repositories, it has been
   removed.

 - Clean up binfmt_elf_fdpic debug output (Greg Ungerer)

* tag 'execve-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  binfmt_misc: enable sandboxed mounts
  binfmt_misc: cleanup on filesystem umount
  binfmt_elf_fdpic: clean up debug warnings
  mm: Remove unused vm_brk()
  binfmt_elf: Only report padzero() errors when PROT_WRITE
  binfmt_elf: Use elf_load() for library
  binfmt_elf: Use elf_load() for interpreter
  binfmt_elf: elf_bss no longer used by load_elf_binary()
  binfmt_elf: Support segments with 0 filesz and misaligned starts
  elf, uapi: Remove struct tag 'dynamic'
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull execve updates from Kees Cook:

 - Support non-BSS ELF segments with zero filesz

   Eric Biederman and I refactored ELF segment loading to handle the
   case where a segment has a smaller filesz than memsz. Traditionally
   linkers only did this for .bss and it was always the last segment. As
   a result, the kernel only handled this case when it was the last
   segment. We've had two recent cases where linkers were trying to use
   these kinds of segments for other reasons, and the were in the middle
   of the segment list. There was no good reason for the kernel not to
   support this, and the refactor actually ends up making things more
   readable too.

 - Enable namespaced binfmt_misc

   Christian Brauner has made it possible to use binfmt_misc with mount
   namespaces. This means some traditionally root-only interfaces (for
   adding/removing formats) are now more exposed (but believed to be
   safe).

 - Remove struct tag 'dynamic' from ELF UAPI

   Alejandro Colomar noticed that the ELF UAPI has been polluting the
   struct namespace with an unused and overly generic tag named
   "dynamic" for no discernible reason for many many years. After
   double-checking various distro source repositories, it has been
   removed.

 - Clean up binfmt_elf_fdpic debug output (Greg Ungerer)

* tag 'execve-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  binfmt_misc: enable sandboxed mounts
  binfmt_misc: cleanup on filesystem umount
  binfmt_elf_fdpic: clean up debug warnings
  mm: Remove unused vm_brk()
  binfmt_elf: Only report padzero() errors when PROT_WRITE
  binfmt_elf: Use elf_load() for library
  binfmt_elf: Use elf_load() for interpreter
  binfmt_elf: elf_bss no longer used by load_elf_binary()
  binfmt_elf: Support segments with 0 filesz and misaligned starts
  elf, uapi: Remove struct tag 'dynamic'
</pre>
</div>
</content>
</entry>
<entry>
<title>fs: convert core infrastructure to new timestamp accessors</title>
<updated>2023-10-18T11:26:15+00:00</updated>
<author>
<name>Jeff Layton</name>
<email>jlayton@kernel.org</email>
</author>
<published>2023-10-04T18:52:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=16a9496523a473866a0b794ee6d6c5b74fe67d40'/>
<id>16a9496523a473866a0b794ee6d6c5b74fe67d40</id>
<content type='text'>
Convert the core vfs code to use the new timestamp accessor functions.

Signed-off-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Link: https://lore.kernel.org/r/20231004185239.80830-2-jlayton@kernel.org
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Convert the core vfs code to use the new timestamp accessor functions.

Signed-off-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Link: https://lore.kernel.org/r/20231004185239.80830-2-jlayton@kernel.org
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>binfmt_misc: enable sandboxed mounts</title>
<updated>2023-10-11T15:46:01+00:00</updated>
<author>
<name>Christian Brauner</name>
<email>christian.brauner@ubuntu.com</email>
</author>
<published>2021-10-28T10:31:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=21ca59b365c091d583f36ac753eaa8baf947be6f'/>
<id>21ca59b365c091d583f36ac753eaa8baf947be6f</id>
<content type='text'>
Enable unprivileged sandboxes to create their own binfmt_misc mounts.
This is based on Laurent's work in [1] but has been significantly
reworked to fix various issues we identified in earlier versions.

While binfmt_misc can currently only be mounted in the initial user
namespace, binary types registered in this binfmt_misc instance are
available to all sandboxes (Either by having them installed in the
sandbox or by registering the binary type with the F flag causing the
interpreter to be opened right away). So binfmt_misc binary types are
already delegated to sandboxes implicitly.

However, while a sandbox has access to all registered binary types in
binfmt_misc a sandbox cannot currently register its own binary types
in binfmt_misc. This has prevented various use-cases some of which were
already outlined in [1] but we have a range of issues associated with
this (cf. [3]-[5] below which are just a small sample).

Extend binfmt_misc to be mountable in non-initial user namespaces.
Similar to other filesystem such as nfsd, mqueue, and sunrpc we use
keyed superblock management. The key determines whether we need to
create a new superblock or can reuse an already existing one. We use the
user namespace of the mount as key. This means a new binfmt_misc
superblock is created once per user namespace creation. Subsequent
mounts of binfmt_misc in the same user namespace will mount the same
binfmt_misc instance. We explicitly do not create a new binfmt_misc
superblock on every binfmt_misc mount as the semantics for
load_misc_binary() line up with the keying model. This also allows us to
retrieve the relevant binfmt_misc instance based on the caller's user
namespace which can be done in a simple (bounded to 32 levels) loop.

Similar to the current binfmt_misc semantics allowing access to the
binary types in the initial binfmt_misc instance we do allow sandboxes
access to their parent's binfmt_misc mounts if they do not have created
a separate binfmt_misc instance.

Overall, this will unblock the use-cases mentioned below and in general
will also allow to support and harden execution of another
architecture's binaries in tight sandboxes. For instance, using the
unshare binary it possible to start a chroot of another architecture and
configure the binfmt_misc interpreter without being root to run the
binaries in this chroot and without requiring the host to modify its
binary type handlers.

Henning had already posted a few experiments in the cover letter at [1].
But here's an additional example where an unprivileged container
registers qemu-user-static binary handlers for various binary types in
its separate binfmt_misc mount and is then seamlessly able to start
containers with a different architecture without affecting the host:

root    [lxc monitor] /var/snap/lxd/common/lxd/containers f1
1000000  \_ /sbin/init
1000000      \_ /lib/systemd/systemd-journald
1000000      \_ /lib/systemd/systemd-udevd
1000100      \_ /lib/systemd/systemd-networkd
1000101      \_ /lib/systemd/systemd-resolved
1000000      \_ /usr/sbin/cron -f
1000103      \_ /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
1000000      \_ /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
1000104      \_ /usr/sbin/rsyslogd -n -iNONE
1000000      \_ /lib/systemd/systemd-logind
1000000      \_ /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 vt220
1000107      \_ dnsmasq --conf-file=/dev/null -u lxc-dnsmasq --strict-order --bind-interfaces --pid-file=/run/lxc/dnsmasq.pid --liste
1000000      \_ [lxc monitor] /var/lib/lxc f1-s390x
1100000          \_ /usr/bin/qemu-s390x-static /sbin/init
1100000              \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-journald
1100000              \_ /usr/bin/qemu-s390x-static /usr/sbin/cron -f
1100103              \_ /usr/bin/qemu-s390x-static /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-ac
1100000              \_ /usr/bin/qemu-s390x-static /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
1100104              \_ /usr/bin/qemu-s390x-static /usr/sbin/rsyslogd -n -iNONE
1100000              \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-logind
1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 vt220
1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/0 115200,38400,9600 vt220
1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/1 115200,38400,9600 vt220
1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/2 115200,38400,9600 vt220
1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/3 115200,38400,9600 vt220
1100000              \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-udevd

[1]: https://lore.kernel.org/all/20191216091220.465626-1-laurent@vivier.eu
[2]: https://discuss.linuxcontainers.org/t/binfmt-misc-permission-denied
[3]: https://discuss.linuxcontainers.org/t/lxd-binfmt-support-for-qemu-static-interpreters
[4]: https://discuss.linuxcontainers.org/t/3-1-0-binfmt-support-service-in-unprivileged-guest-requires-write-access-on-hosts-proc-sys-fs-binfmt-misc
[5]: https://discuss.linuxcontainers.org/t/qemu-user-static-not-working-4-11

Link: https://lore.kernel.org/r/20191216091220.465626-2-laurent@vivier.eu (origin)
Link: https://lore.kernel.org/r/20211028103114.2849140-2-brauner@kernel.org (v1)
Cc: Sargun Dhillon &lt;sargun@sargun.me&gt;
Cc: Serge Hallyn &lt;serge@hallyn.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Henning Schild &lt;henning.schild@siemens.com&gt;
Cc: Andrei Vagin &lt;avagin@gmail.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Laurent Vivier &lt;laurent@vivier.eu&gt;
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Laurent Vivier &lt;laurent@vivier.eu&gt;
Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
---
/* v2 */
- Serge Hallyn &lt;serge@hallyn.com&gt;:
  - Use GFP_KERNEL_ACCOUNT for userspace triggered allocations when a
    new binary type handler is registered.
- Christian Brauner &lt;christian.brauner@ubuntu.com&gt;:
  - Switch authorship to me. I refused to do that earlier even though
    Laurent said I should do so because I think it's genuinely bad form.
    But by now I have changed so many things that it'd be unfair to
    blame Laurent for any potential bugs in here.
  - Add more comments that explain what's going on.
  - Rename functions while changing them to better reflect what they are
    doing to make the code easier to understand.
  - In the first version when a specific binary type handler was removed
    either through a write to the entry's file or all binary type
    handlers were removed by a write to the binfmt_misc mount's status
    file all cleanup work happened during inode eviction.
    That includes removal of the relevant entries from entry list. While
    that works fine I disliked that model after thinking about it for a
    bit. Because it means that there was a window were someone has
    already removed a or all binary handlers but they could still be
    safely reached from load_misc_binary() when it has managed to take
    the read_lock() on the entries list while inode eviction was already
    happening. Again, that perfectly benign but it's cleaner to remove
    the binary handler from the list immediately meaning that ones the
    write to then entry's file or the binfmt_misc status file returns
    the binary type cannot be executed anymore. That gives stronger
    guarantees to the user.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Enable unprivileged sandboxes to create their own binfmt_misc mounts.
This is based on Laurent's work in [1] but has been significantly
reworked to fix various issues we identified in earlier versions.

While binfmt_misc can currently only be mounted in the initial user
namespace, binary types registered in this binfmt_misc instance are
available to all sandboxes (Either by having them installed in the
sandbox or by registering the binary type with the F flag causing the
interpreter to be opened right away). So binfmt_misc binary types are
already delegated to sandboxes implicitly.

However, while a sandbox has access to all registered binary types in
binfmt_misc a sandbox cannot currently register its own binary types
in binfmt_misc. This has prevented various use-cases some of which were
already outlined in [1] but we have a range of issues associated with
this (cf. [3]-[5] below which are just a small sample).

Extend binfmt_misc to be mountable in non-initial user namespaces.
Similar to other filesystem such as nfsd, mqueue, and sunrpc we use
keyed superblock management. The key determines whether we need to
create a new superblock or can reuse an already existing one. We use the
user namespace of the mount as key. This means a new binfmt_misc
superblock is created once per user namespace creation. Subsequent
mounts of binfmt_misc in the same user namespace will mount the same
binfmt_misc instance. We explicitly do not create a new binfmt_misc
superblock on every binfmt_misc mount as the semantics for
load_misc_binary() line up with the keying model. This also allows us to
retrieve the relevant binfmt_misc instance based on the caller's user
namespace which can be done in a simple (bounded to 32 levels) loop.

Similar to the current binfmt_misc semantics allowing access to the
binary types in the initial binfmt_misc instance we do allow sandboxes
access to their parent's binfmt_misc mounts if they do not have created
a separate binfmt_misc instance.

Overall, this will unblock the use-cases mentioned below and in general
will also allow to support and harden execution of another
architecture's binaries in tight sandboxes. For instance, using the
unshare binary it possible to start a chroot of another architecture and
configure the binfmt_misc interpreter without being root to run the
binaries in this chroot and without requiring the host to modify its
binary type handlers.

Henning had already posted a few experiments in the cover letter at [1].
But here's an additional example where an unprivileged container
registers qemu-user-static binary handlers for various binary types in
its separate binfmt_misc mount and is then seamlessly able to start
containers with a different architecture without affecting the host:

root    [lxc monitor] /var/snap/lxd/common/lxd/containers f1
1000000  \_ /sbin/init
1000000      \_ /lib/systemd/systemd-journald
1000000      \_ /lib/systemd/systemd-udevd
1000100      \_ /lib/systemd/systemd-networkd
1000101      \_ /lib/systemd/systemd-resolved
1000000      \_ /usr/sbin/cron -f
1000103      \_ /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
1000000      \_ /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
1000104      \_ /usr/sbin/rsyslogd -n -iNONE
1000000      \_ /lib/systemd/systemd-logind
1000000      \_ /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 vt220
1000107      \_ dnsmasq --conf-file=/dev/null -u lxc-dnsmasq --strict-order --bind-interfaces --pid-file=/run/lxc/dnsmasq.pid --liste
1000000      \_ [lxc monitor] /var/lib/lxc f1-s390x
1100000          \_ /usr/bin/qemu-s390x-static /sbin/init
1100000              \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-journald
1100000              \_ /usr/bin/qemu-s390x-static /usr/sbin/cron -f
1100103              \_ /usr/bin/qemu-s390x-static /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-ac
1100000              \_ /usr/bin/qemu-s390x-static /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
1100104              \_ /usr/bin/qemu-s390x-static /usr/sbin/rsyslogd -n -iNONE
1100000              \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-logind
1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 vt220
1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/0 115200,38400,9600 vt220
1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/1 115200,38400,9600 vt220
1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/2 115200,38400,9600 vt220
1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/3 115200,38400,9600 vt220
1100000              \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-udevd

[1]: https://lore.kernel.org/all/20191216091220.465626-1-laurent@vivier.eu
[2]: https://discuss.linuxcontainers.org/t/binfmt-misc-permission-denied
[3]: https://discuss.linuxcontainers.org/t/lxd-binfmt-support-for-qemu-static-interpreters
[4]: https://discuss.linuxcontainers.org/t/3-1-0-binfmt-support-service-in-unprivileged-guest-requires-write-access-on-hosts-proc-sys-fs-binfmt-misc
[5]: https://discuss.linuxcontainers.org/t/qemu-user-static-not-working-4-11

Link: https://lore.kernel.org/r/20191216091220.465626-2-laurent@vivier.eu (origin)
Link: https://lore.kernel.org/r/20211028103114.2849140-2-brauner@kernel.org (v1)
Cc: Sargun Dhillon &lt;sargun@sargun.me&gt;
Cc: Serge Hallyn &lt;serge@hallyn.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Henning Schild &lt;henning.schild@siemens.com&gt;
Cc: Andrei Vagin &lt;avagin@gmail.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Laurent Vivier &lt;laurent@vivier.eu&gt;
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Laurent Vivier &lt;laurent@vivier.eu&gt;
Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
---
/* v2 */
- Serge Hallyn &lt;serge@hallyn.com&gt;:
  - Use GFP_KERNEL_ACCOUNT for userspace triggered allocations when a
    new binary type handler is registered.
- Christian Brauner &lt;christian.brauner@ubuntu.com&gt;:
  - Switch authorship to me. I refused to do that earlier even though
    Laurent said I should do so because I think it's genuinely bad form.
    But by now I have changed so many things that it'd be unfair to
    blame Laurent for any potential bugs in here.
  - Add more comments that explain what's going on.
  - Rename functions while changing them to better reflect what they are
    doing to make the code easier to understand.
  - In the first version when a specific binary type handler was removed
    either through a write to the entry's file or all binary type
    handlers were removed by a write to the binfmt_misc mount's status
    file all cleanup work happened during inode eviction.
    That includes removal of the relevant entries from entry list. While
    that works fine I disliked that model after thinking about it for a
    bit. Because it means that there was a window were someone has
    already removed a or all binary handlers but they could still be
    safely reached from load_misc_binary() when it has managed to take
    the read_lock() on the entries list while inode eviction was already
    happening. Again, that perfectly benign but it's cleaner to remove
    the binary handler from the list immediately meaning that ones the
    write to then entry's file or the binfmt_misc status file returns
    the binary type cannot be executed anymore. That gives stronger
    guarantees to the user.
</pre>
</div>
</content>
</entry>
<entry>
<title>binfmt_misc: cleanup on filesystem umount</title>
<updated>2023-10-11T15:46:01+00:00</updated>
<author>
<name>Christian Brauner</name>
<email>christian.brauner@ubuntu.com</email>
</author>
<published>2021-10-28T10:31:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=1c5976ef0f7ad76319df748ccb99a4c7ba2ba464'/>
<id>1c5976ef0f7ad76319df748ccb99a4c7ba2ba464</id>
<content type='text'>
Currently, registering a new binary type pins the binfmt_misc
filesystem. Specifically, this means that as long as there is at least
one binary type registered the binfmt_misc filesystem survives all
umounts, i.e. the superblock is not destroyed. Meaning that a umount
followed by another mount will end up with the same superblock and the
same binary type handlers. This is a behavior we tend to discourage for
any new filesystems (apart from a few special filesystems such as e.g.
configfs or debugfs). A umount operation without the filesystem being
pinned - by e.g. someone holding a file descriptor to an open file -
should usually result in the destruction of the superblock and all
associated resources. This makes introspection easier and leads to
clearly defined, simple and clean semantics. An administrator can rely
on the fact that a umount will guarantee a clean slate making it
possible to reinitialize a filesystem. Right now all binary types would
need to be explicitly deleted before that can happen.

This allows us to remove the heavy-handed calls to simple_pin_fs() and
simple_release_fs() when creating and deleting binary types. This in
turn allows us to replace the current brittle pinning mechanism abusing
dget() which has caused a range of bugs judging from prior fixes in [2]
and [3]. The additional dget() in load_misc_binary() pins the dentry but
only does so for the sake to prevent -&gt;evict_inode() from freeing the
node when a user removes the binary type and kill_node() is run. Which
would mean -&gt;interpreter and -&gt;interp_file would be freed causing a UAF.

This isn't really nicely documented nor is it very clean because it
relies on simple_pin_fs() pinning the filesystem as long as at least one
binary type exists. Otherwise it would cause load_misc_binary() to hold
on to a dentry belonging to a superblock that has been shutdown.
Replace that implicit pinning with a clean and simple per-node refcount
and get rid of the ugly dget() pinning. A similar mechanism exists for
e.g. binderfs (cf. [4]). All the cleanup work can now be done in
-&gt;evict_inode().

In a follow-up patch we will make it possible to use binfmt_misc in
sandboxes. We will use the cleaner semantics where a umount for the
filesystem will cause the superblock and all resources to be
deallocated. In preparation for this apply the same semantics to the
initial binfmt_misc mount. Note, that this is a user-visible change and
as such a uapi change but one that we can reasonably risk. We've
discussed this in earlier versions of this patchset (cf. [1]).

The main user and provider of binfmt_misc is systemd. Systemd provides
binfmt_misc via autofs since it is configurable as a kernel module and
is used by a few exotic packages and users. As such a binfmt_misc mount
is triggered when /proc/sys/fs/binfmt_misc is accessed and is only
provided on demand. Other autofs on demand filesystems include EFI ESP
which systemd umounts if the mountpoint stays idle for a certain amount
of time. This doesn't apply to the binfmt_misc autofs mount which isn't
touched once it is mounted meaning this change can't accidently wipe
binary type handlers without someone having explicitly unmounted
binfmt_misc. After speaking to systemd folks they don't expect this
change to affect them.

In line with our general policy, if we see a regression for systemd or
other users with this change we will switch back to the old behavior for
the initial binfmt_misc mount and have binary types pin the filesystem
again. But while we touch this code let's take the chance and let's
improve on the status quo.

[1]: https://lore.kernel.org/r/20191216091220.465626-2-laurent@vivier.eu
[2]: commit 43a4f2619038 ("exec: binfmt_misc: fix race between load_misc_binary() and kill_node()"
[3]: commit 83f918274e4b ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()")
[4]: commit f0fe2c0f050d ("binder: prevent UAF for binderfs devices II")

Link: https://lore.kernel.org/r/20211028103114.2849140-1-brauner@kernel.org (v1)
Cc: Sargun Dhillon &lt;sargun@sargun.me&gt;
Cc: Serge Hallyn &lt;serge@hallyn.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Henning Schild &lt;henning.schild@siemens.com&gt;
Cc: Andrei Vagin &lt;avagin@gmail.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Laurent Vivier &lt;laurent@vivier.eu&gt;
Cc: linux-fsdevel@vger.kernel.org
Acked-by: Serge Hallyn &lt;serge@hallyn.com&gt;
Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
---
/* v2 */
- Christian Brauner &lt;christian.brauner@ubuntu.com&gt;:
  - Add more comments that explain what's going on.
  - Rename functions while changing them to better reflect what they are
    doing to make the code easier to understand.
  - In the first version when a specific binary type handler was removed
    either through a write to the entry's file or all binary type
    handlers were removed by a write to the binfmt_misc mount's status
    file all cleanup work happened during inode eviction.
    That includes removal of the relevant entries from entry list. While
    that works fine I disliked that model after thinking about it for a
    bit. Because it means that there was a window were someone has
    already removed a or all binary handlers but they could still be
    safely reached from load_misc_binary() when it has managed to take
    the read_lock() on the entries list while inode eviction was already
    happening. Again, that perfectly benign but it's cleaner to remove
    the binary handler from the list immediately meaning that ones the
    write to then entry's file or the binfmt_misc status file returns
    the binary type cannot be executed anymore. That gives stronger
    guarantees to the user.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Currently, registering a new binary type pins the binfmt_misc
filesystem. Specifically, this means that as long as there is at least
one binary type registered the binfmt_misc filesystem survives all
umounts, i.e. the superblock is not destroyed. Meaning that a umount
followed by another mount will end up with the same superblock and the
same binary type handlers. This is a behavior we tend to discourage for
any new filesystems (apart from a few special filesystems such as e.g.
configfs or debugfs). A umount operation without the filesystem being
pinned - by e.g. someone holding a file descriptor to an open file -
should usually result in the destruction of the superblock and all
associated resources. This makes introspection easier and leads to
clearly defined, simple and clean semantics. An administrator can rely
on the fact that a umount will guarantee a clean slate making it
possible to reinitialize a filesystem. Right now all binary types would
need to be explicitly deleted before that can happen.

This allows us to remove the heavy-handed calls to simple_pin_fs() and
simple_release_fs() when creating and deleting binary types. This in
turn allows us to replace the current brittle pinning mechanism abusing
dget() which has caused a range of bugs judging from prior fixes in [2]
and [3]. The additional dget() in load_misc_binary() pins the dentry but
only does so for the sake to prevent -&gt;evict_inode() from freeing the
node when a user removes the binary type and kill_node() is run. Which
would mean -&gt;interpreter and -&gt;interp_file would be freed causing a UAF.

This isn't really nicely documented nor is it very clean because it
relies on simple_pin_fs() pinning the filesystem as long as at least one
binary type exists. Otherwise it would cause load_misc_binary() to hold
on to a dentry belonging to a superblock that has been shutdown.
Replace that implicit pinning with a clean and simple per-node refcount
and get rid of the ugly dget() pinning. A similar mechanism exists for
e.g. binderfs (cf. [4]). All the cleanup work can now be done in
-&gt;evict_inode().

In a follow-up patch we will make it possible to use binfmt_misc in
sandboxes. We will use the cleaner semantics where a umount for the
filesystem will cause the superblock and all resources to be
deallocated. In preparation for this apply the same semantics to the
initial binfmt_misc mount. Note, that this is a user-visible change and
as such a uapi change but one that we can reasonably risk. We've
discussed this in earlier versions of this patchset (cf. [1]).

The main user and provider of binfmt_misc is systemd. Systemd provides
binfmt_misc via autofs since it is configurable as a kernel module and
is used by a few exotic packages and users. As such a binfmt_misc mount
is triggered when /proc/sys/fs/binfmt_misc is accessed and is only
provided on demand. Other autofs on demand filesystems include EFI ESP
which systemd umounts if the mountpoint stays idle for a certain amount
of time. This doesn't apply to the binfmt_misc autofs mount which isn't
touched once it is mounted meaning this change can't accidently wipe
binary type handlers without someone having explicitly unmounted
binfmt_misc. After speaking to systemd folks they don't expect this
change to affect them.

In line with our general policy, if we see a regression for systemd or
other users with this change we will switch back to the old behavior for
the initial binfmt_misc mount and have binary types pin the filesystem
again. But while we touch this code let's take the chance and let's
improve on the status quo.

[1]: https://lore.kernel.org/r/20191216091220.465626-2-laurent@vivier.eu
[2]: commit 43a4f2619038 ("exec: binfmt_misc: fix race between load_misc_binary() and kill_node()"
[3]: commit 83f918274e4b ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()")
[4]: commit f0fe2c0f050d ("binder: prevent UAF for binderfs devices II")

Link: https://lore.kernel.org/r/20211028103114.2849140-1-brauner@kernel.org (v1)
Cc: Sargun Dhillon &lt;sargun@sargun.me&gt;
Cc: Serge Hallyn &lt;serge@hallyn.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Henning Schild &lt;henning.schild@siemens.com&gt;
Cc: Andrei Vagin &lt;avagin@gmail.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Laurent Vivier &lt;laurent@vivier.eu&gt;
Cc: linux-fsdevel@vger.kernel.org
Acked-by: Serge Hallyn &lt;serge@hallyn.com&gt;
Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
---
/* v2 */
- Christian Brauner &lt;christian.brauner@ubuntu.com&gt;:
  - Add more comments that explain what's going on.
  - Rename functions while changing them to better reflect what they are
    doing to make the code easier to understand.
  - In the first version when a specific binary type handler was removed
    either through a write to the entry's file or all binary type
    handlers were removed by a write to the binfmt_misc mount's status
    file all cleanup work happened during inode eviction.
    That includes removal of the relevant entries from entry list. While
    that works fine I disliked that model after thinking about it for a
    bit. Because it means that there was a window were someone has
    already removed a or all binary handlers but they could still be
    safely reached from load_misc_binary() when it has managed to take
    the read_lock() on the entries list while inode eviction was already
    happening. Again, that perfectly benign but it's cleaner to remove
    the binary handler from the list immediately meaning that ones the
    write to then entry's file or the binfmt_misc status file returns
    the binary type cannot be executed anymore. That gives stronger
    guarantees to the user.
</pre>
</div>
</content>
</entry>
<entry>
<title>fs: convert to ctime accessor functions</title>
<updated>2023-07-13T08:28:04+00:00</updated>
<author>
<name>Jeff Layton</name>
<email>jlayton@kernel.org</email>
</author>
<published>2023-07-05T19:00:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=2276e5ba8567f683c49a36ba885d0fe6abe2b45e'/>
<id>2276e5ba8567f683c49a36ba885d0fe6abe2b45e</id>
<content type='text'>
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode-&gt;i_ctime.

Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Message-Id: &lt;20230705190309.579783-23-jlayton@kernel.org&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode-&gt;i_ctime.

Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Jeff Layton &lt;jlayton@kernel.org&gt;
Message-Id: &lt;20230705190309.579783-23-jlayton@kernel.org&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>binfmt_misc: fix shift-out-of-bounds in check_special_flags</title>
<updated>2022-12-02T21:57:04+00:00</updated>
<author>
<name>Liu Shixin</name>
<email>liushixin2@huawei.com</email>
</author>
<published>2022-11-02T02:51:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=6a46bf558803dd2b959ca7435a5c143efe837217'/>
<id>6a46bf558803dd2b959ca7435a5c143efe837217</id>
<content type='text'>
UBSAN reported a shift-out-of-bounds warning:

 left shift of 1 by 31 places cannot be represented in type 'int'
 Call Trace:
  &lt;TASK&gt;
  __dump_stack lib/dump_stack.c:88 [inline]
  dump_stack_lvl+0x8d/0xcf lib/dump_stack.c:106
  ubsan_epilogue+0xa/0x44 lib/ubsan.c:151
  __ubsan_handle_shift_out_of_bounds+0x1e7/0x208 lib/ubsan.c:322
  check_special_flags fs/binfmt_misc.c:241 [inline]
  create_entry fs/binfmt_misc.c:456 [inline]
  bm_register_write+0x9d3/0xa20 fs/binfmt_misc.c:654
  vfs_write+0x11e/0x580 fs/read_write.c:582
  ksys_write+0xcf/0x120 fs/read_write.c:637
  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
  do_syscall_64+0x34/0x80 arch/x86/entry/common.c:80
  entry_SYSCALL_64_after_hwframe+0x63/0xcd
 RIP: 0033:0x4194e1

Since the type of Node's flags is unsigned long, we should define these
macros with same type too.

Signed-off-by: Liu Shixin &lt;liushixin2@huawei.com&gt;
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
Link: https://lore.kernel.org/r/20221102025123.1117184-1-liushixin2@huawei.com
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
UBSAN reported a shift-out-of-bounds warning:

 left shift of 1 by 31 places cannot be represented in type 'int'
 Call Trace:
  &lt;TASK&gt;
  __dump_stack lib/dump_stack.c:88 [inline]
  dump_stack_lvl+0x8d/0xcf lib/dump_stack.c:106
  ubsan_epilogue+0xa/0x44 lib/ubsan.c:151
  __ubsan_handle_shift_out_of_bounds+0x1e7/0x208 lib/ubsan.c:322
  check_special_flags fs/binfmt_misc.c:241 [inline]
  create_entry fs/binfmt_misc.c:456 [inline]
  bm_register_write+0x9d3/0xa20 fs/binfmt_misc.c:654
  vfs_write+0x11e/0x580 fs/read_write.c:582
  ksys_write+0xcf/0x120 fs/read_write.c:637
  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
  do_syscall_64+0x34/0x80 arch/x86/entry/common.c:80
  entry_SYSCALL_64_after_hwframe+0x63/0xcd
 RIP: 0033:0x4194e1

Since the type of Node's flags is unsigned long, we should define these
macros with same type too.

Signed-off-by: Liu Shixin &lt;liushixin2@huawei.com&gt;
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
Link: https://lore.kernel.org/r/20221102025123.1117184-1-liushixin2@huawei.com
</pre>
</div>
</content>
</entry>
<entry>
<title>Fix regression due to "fs: move binfmt_misc sysctl to its own file"</title>
<updated>2022-02-09T17:50:02+00:00</updated>
<author>
<name>Domenico Andreoli</name>
<email>domenico.andreoli@linux.com</email>
</author>
<published>2022-02-09T07:49:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=b42bc9a3c5115c3102a4923776bbeed3b191f2db'/>
<id>b42bc9a3c5115c3102a4923776bbeed3b191f2db</id>
<content type='text'>
Commit 3ba442d5331f ("fs: move binfmt_misc sysctl to its own file") did
not go unnoticed, binfmt-support stopped to work on my Debian system
since v5.17-rc2 (did not check with -rc1).

The existance of the /proc/sys/fs/binfmt_misc is a precondition for
attempting to mount the binfmt_misc fs, which in turn triggers the
autoload of the binfmt_misc module.  Without it, no module is loaded and
no binfmt is available at boot.

Building as built-in or manually loading the module and mounting the fs
works fine, it's therefore only a matter of interaction with user-space.
I could try to improve the Debian systemd configuration but I can't say
anything about the other distributions.

This patch restores a working system right after boot.

Fixes: 3ba442d5331f ("fs: move binfmt_misc sysctl to its own file")
Signed-off-by: Domenico Andreoli &lt;domenico.andreoli@linux.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Luis Chamberlain &lt;mcgrof@kernel.org&gt;
Reviewed-by: Tong Zhang &lt;ztong0001@gmail.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Commit 3ba442d5331f ("fs: move binfmt_misc sysctl to its own file") did
not go unnoticed, binfmt-support stopped to work on my Debian system
since v5.17-rc2 (did not check with -rc1).

The existance of the /proc/sys/fs/binfmt_misc is a precondition for
attempting to mount the binfmt_misc fs, which in turn triggers the
autoload of the binfmt_misc module.  Without it, no module is loaded and
no binfmt is available at boot.

Building as built-in or manually loading the module and mounting the fs
works fine, it's therefore only a matter of interaction with user-space.
I could try to improve the Debian systemd configuration but I can't say
anything about the other distributions.

This patch restores a working system right after boot.

Fixes: 3ba442d5331f ("fs: move binfmt_misc sysctl to its own file")
Signed-off-by: Domenico Andreoli &lt;domenico.andreoli@linux.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Luis Chamberlain &lt;mcgrof@kernel.org&gt;
Reviewed-by: Tong Zhang &lt;ztong0001@gmail.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>binfmt_misc: fix crash when load/unload module</title>
<updated>2022-01-30T07:56:58+00:00</updated>
<author>
<name>Tong Zhang</name>
<email>ztong0001@gmail.com</email>
</author>
<published>2022-01-29T21:40:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=e7f1e8834b2b2144dfbe0b2235d05e4f6f95882e'/>
<id>e7f1e8834b2b2144dfbe0b2235d05e4f6f95882e</id>
<content type='text'>
We should unregister the table upon module unload otherwise something
horrible will happen when we load binfmt_misc module again.  Also note
that we should keep value returned by register_sysctl_mount_point() and
release it later, otherwise it will leak.

Also, per Christian's comment, to fully restore the old behavior that
won't break userspace the check(binfmt_misc_header) should be
eliminated.

To reproduce:
  modprobe binfmt_misc
  modprobe -r binfmt_misc
  modprobe binfmt_misc
  modprobe -r binfmt_misc
  modprobe binfmt_misc

resulting in

  modprobe: can't load module binfmt_misc (kernel/fs/binfmt_misc.ko): Cannot allocate memory

and an unhappy kernel:

  binfmt_misc: Failed to create fs/binfmt_misc sysctl mount point
  binfmt_misc: Failed to create fs/binfmt_misc sysctl mount point
  BUG: unable to handle page fault for address: fffffbfff8004802
  Call Trace:
    init_misc_binfmt+0x2d/0x1000 [binfmt_misc]

Link: https://lkml.kernel.org/r/20220124181812.1869535-2-ztong0001@gmail.com
Fixes: 3ba442d5331f ("fs: move binfmt_misc sysctl to its own file")
Signed-off-by: Tong Zhang &lt;ztong0001@gmail.com&gt;
Co-developed-by: Christian Brauner&lt;brauner@kernel.org&gt;
Acked-by: Luis Chamberlain &lt;mcgrof@kernel.org&gt;
Cc: Eric Biederman &lt;ebiederm@xmission.com&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Iurii Zaikin &lt;yzaikin@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We should unregister the table upon module unload otherwise something
horrible will happen when we load binfmt_misc module again.  Also note
that we should keep value returned by register_sysctl_mount_point() and
release it later, otherwise it will leak.

Also, per Christian's comment, to fully restore the old behavior that
won't break userspace the check(binfmt_misc_header) should be
eliminated.

To reproduce:
  modprobe binfmt_misc
  modprobe -r binfmt_misc
  modprobe binfmt_misc
  modprobe -r binfmt_misc
  modprobe binfmt_misc

resulting in

  modprobe: can't load module binfmt_misc (kernel/fs/binfmt_misc.ko): Cannot allocate memory

and an unhappy kernel:

  binfmt_misc: Failed to create fs/binfmt_misc sysctl mount point
  binfmt_misc: Failed to create fs/binfmt_misc sysctl mount point
  BUG: unable to handle page fault for address: fffffbfff8004802
  Call Trace:
    init_misc_binfmt+0x2d/0x1000 [binfmt_misc]

Link: https://lkml.kernel.org/r/20220124181812.1869535-2-ztong0001@gmail.com
Fixes: 3ba442d5331f ("fs: move binfmt_misc sysctl to its own file")
Signed-off-by: Tong Zhang &lt;ztong0001@gmail.com&gt;
Co-developed-by: Christian Brauner&lt;brauner@kernel.org&gt;
Acked-by: Luis Chamberlain &lt;mcgrof@kernel.org&gt;
Cc: Eric Biederman &lt;ebiederm@xmission.com&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Iurii Zaikin &lt;yzaikin@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>fs: move binfmt_misc sysctl to its own file</title>
<updated>2022-01-22T06:33:35+00:00</updated>
<author>
<name>Luis Chamberlain</name>
<email>mcgrof@kernel.org</email>
</author>
<published>2022-01-22T06:12:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=3ba442d5331ff4d4339faf4986d0841386196f92'/>
<id>3ba442d5331ff4d4339faf4986d0841386196f92</id>
<content type='text'>
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.

To help with this maintenance let's start by moving sysctls to places
where they actually belong.  The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.

This moves the binfmt_misc sysctl to its own file to help remove clutter
from kernel/sysctl.c.

Link: https://lkml.kernel.org/r/20211124231435.1445213-5-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain &lt;mcgrof@kernel.org&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Amir Goldstein &lt;amir73il@gmail.com&gt;
Cc: Andy Shevchenko &lt;andriy.shevchenko@linux.intel.com&gt;
Cc: Antti Palosaari &lt;crope@iki.fi&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Benjamin LaHaise &lt;bcrl@kvack.org&gt;
Cc: Clemens Ladisch &lt;clemens@ladisch.de&gt;
Cc: David Airlie &lt;airlied@linux.ie&gt;
Cc: Douglas Gilbert &lt;dgilbert@interlog.com&gt;
Cc: Eric Biederman &lt;ebiederm@xmission.com&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: Iurii Zaikin &lt;yzaikin@google.com&gt;
Cc: James E.J. Bottomley &lt;jejb@linux.ibm.com&gt;
Cc: Jani Nikula &lt;jani.nikula@intel.com&gt;
Cc: Jani Nikula &lt;jani.nikula@linux.intel.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Joel Becker &lt;jlbec@evilplan.org&gt;
Cc: John Ogness &lt;john.ogness@linutronix.de&gt;
Cc: Joonas Lahtinen &lt;joonas.lahtinen@linux.intel.com&gt;
Cc: Joseph Qi &lt;joseph.qi@linux.alibaba.com&gt;
Cc: Julia Lawall &lt;julia.lawall@inria.fr&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Lukas Middendorf &lt;kernel@tuxforce.de&gt;
Cc: Mark Fasheh &lt;mark@fasheh.com&gt;
Cc: Martin K. Petersen &lt;martin.petersen@oracle.com&gt;
Cc: Paul Turner &lt;pjt@google.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Petr Mladek &lt;pmladek@suse.com&gt;
Cc: Phillip Potter &lt;phil@philpotter.co.uk&gt;
Cc: Qing Wang &lt;wangqing@vivo.com&gt;
Cc: "Rafael J. Wysocki" &lt;rafael@kernel.org&gt;
Cc: Rodrigo Vivi &lt;rodrigo.vivi@intel.com&gt;
Cc: Sebastian Reichel &lt;sre@kernel.org&gt;
Cc: Sergey Senozhatsky &lt;senozhatsky@chromium.org&gt;
Cc: Stephen Kitt &lt;steve@sk2.org&gt;
Cc: Steven Rostedt (VMware) &lt;rostedt@goodmis.org&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
Cc: "Theodore Ts'o" &lt;tytso@mit.edu&gt;
Cc: Xiaoming Ni &lt;nixiaoming@huawei.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.

To help with this maintenance let's start by moving sysctls to places
where they actually belong.  The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.

This moves the binfmt_misc sysctl to its own file to help remove clutter
from kernel/sysctl.c.

Link: https://lkml.kernel.org/r/20211124231435.1445213-5-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain &lt;mcgrof@kernel.org&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Amir Goldstein &lt;amir73il@gmail.com&gt;
Cc: Andy Shevchenko &lt;andriy.shevchenko@linux.intel.com&gt;
Cc: Antti Palosaari &lt;crope@iki.fi&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Benjamin LaHaise &lt;bcrl@kvack.org&gt;
Cc: Clemens Ladisch &lt;clemens@ladisch.de&gt;
Cc: David Airlie &lt;airlied@linux.ie&gt;
Cc: Douglas Gilbert &lt;dgilbert@interlog.com&gt;
Cc: Eric Biederman &lt;ebiederm@xmission.com&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: Iurii Zaikin &lt;yzaikin@google.com&gt;
Cc: James E.J. Bottomley &lt;jejb@linux.ibm.com&gt;
Cc: Jani Nikula &lt;jani.nikula@intel.com&gt;
Cc: Jani Nikula &lt;jani.nikula@linux.intel.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Joel Becker &lt;jlbec@evilplan.org&gt;
Cc: John Ogness &lt;john.ogness@linutronix.de&gt;
Cc: Joonas Lahtinen &lt;joonas.lahtinen@linux.intel.com&gt;
Cc: Joseph Qi &lt;joseph.qi@linux.alibaba.com&gt;
Cc: Julia Lawall &lt;julia.lawall@inria.fr&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Lukas Middendorf &lt;kernel@tuxforce.de&gt;
Cc: Mark Fasheh &lt;mark@fasheh.com&gt;
Cc: Martin K. Petersen &lt;martin.petersen@oracle.com&gt;
Cc: Paul Turner &lt;pjt@google.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Petr Mladek &lt;pmladek@suse.com&gt;
Cc: Phillip Potter &lt;phil@philpotter.co.uk&gt;
Cc: Qing Wang &lt;wangqing@vivo.com&gt;
Cc: "Rafael J. Wysocki" &lt;rafael@kernel.org&gt;
Cc: Rodrigo Vivi &lt;rodrigo.vivi@intel.com&gt;
Cc: Sebastian Reichel &lt;sre@kernel.org&gt;
Cc: Sergey Senozhatsky &lt;senozhatsky@chromium.org&gt;
Cc: Stephen Kitt &lt;steve@sk2.org&gt;
Cc: Steven Rostedt (VMware) &lt;rostedt@goodmis.org&gt;
Cc: Suren Baghdasaryan &lt;surenb@google.com&gt;
Cc: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
Cc: "Theodore Ts'o" &lt;tytso@mit.edu&gt;
Cc: Xiaoming Ni &lt;nixiaoming@huawei.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>binfmt_misc: fix possible deadlock in bm_register_write</title>
<updated>2021-03-13T19:27:30+00:00</updated>
<author>
<name>Lior Ribak</name>
<email>liorribak@gmail.com</email>
</author>
<published>2021-03-13T05:07:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=e7850f4d844e0acfac7e570af611d89deade3146'/>
<id>e7850f4d844e0acfac7e570af611d89deade3146</id>
<content type='text'>
There is a deadlock in bm_register_write:

First, in the begining of the function, a lock is taken on the binfmt_misc
root inode with inode_lock(d_inode(root)).

Then, if the user used the MISC_FMT_OPEN_FILE flag, the function will call
open_exec on the user-provided interpreter.

open_exec will call a path lookup, and if the path lookup process includes
the root of binfmt_misc, it will try to take a shared lock on its inode
again, but it is already locked, and the code will get stuck in a deadlock

To reproduce the bug:
$ echo ":iiiii:E::ii::/proc/sys/fs/binfmt_misc/bla:F" &gt; /proc/sys/fs/binfmt_misc/register

backtrace of where the lock occurs (#5):
0  schedule () at ./arch/x86/include/asm/current.h:15
1  0xffffffff81b51237 in rwsem_down_read_slowpath (sem=0xffff888003b202e0, count=&lt;optimized out&gt;, state=state@entry=2) at kernel/locking/rwsem.c:992
2  0xffffffff81b5150a in __down_read_common (state=2, sem=&lt;optimized out&gt;) at kernel/locking/rwsem.c:1213
3  __down_read (sem=&lt;optimized out&gt;) at kernel/locking/rwsem.c:1222
4  down_read (sem=&lt;optimized out&gt;) at kernel/locking/rwsem.c:1355
5  0xffffffff811ee22a in inode_lock_shared (inode=&lt;optimized out&gt;) at ./include/linux/fs.h:783
6  open_last_lookups (op=0xffffc9000022fe34, file=0xffff888004098600, nd=0xffffc9000022fd10) at fs/namei.c:3177
7  path_openat (nd=nd@entry=0xffffc9000022fd10, op=op@entry=0xffffc9000022fe34, flags=flags@entry=65) at fs/namei.c:3366
8  0xffffffff811efe1c in do_filp_open (dfd=&lt;optimized out&gt;, pathname=pathname@entry=0xffff8880031b9000, op=op@entry=0xffffc9000022fe34) at fs/namei.c:3396
9  0xffffffff811e493f in do_open_execat (fd=fd@entry=-100, name=name@entry=0xffff8880031b9000, flags=&lt;optimized out&gt;, flags@entry=0) at fs/exec.c:913
10 0xffffffff811e4a92 in open_exec (name=&lt;optimized out&gt;) at fs/exec.c:948
11 0xffffffff8124aa84 in bm_register_write (file=&lt;optimized out&gt;, buffer=&lt;optimized out&gt;, count=19, ppos=&lt;optimized out&gt;) at fs/binfmt_misc.c:682
12 0xffffffff811decd2 in vfs_write (file=file@entry=0xffff888004098500, buf=buf@entry=0xa758d0 ":iiiii:E::ii::i:CF
", count=count@entry=19, pos=pos@entry=0xffffc9000022ff10) at fs/read_write.c:603
13 0xffffffff811defda in ksys_write (fd=&lt;optimized out&gt;, buf=0xa758d0 ":iiiii:E::ii::i:CF
", count=19) at fs/read_write.c:658
14 0xffffffff81b49813 in do_syscall_64 (nr=&lt;optimized out&gt;, regs=0xffffc9000022ff58) at arch/x86/entry/common.c:46
15 0xffffffff81c0007c in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120

To solve the issue, the open_exec call is moved to before the write
lock is taken by bm_register_write

Link: https://lkml.kernel.org/r/20210228224414.95962-1-liorribak@gmail.com
Fixes: 948b701a607f1 ("binfmt_misc: add persistent opened binary handler for containers")
Signed-off-by: Lior Ribak &lt;liorribak@gmail.com&gt;
Acked-by: Helge Deller &lt;deller@gmx.de&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
There is a deadlock in bm_register_write:

First, in the begining of the function, a lock is taken on the binfmt_misc
root inode with inode_lock(d_inode(root)).

Then, if the user used the MISC_FMT_OPEN_FILE flag, the function will call
open_exec on the user-provided interpreter.

open_exec will call a path lookup, and if the path lookup process includes
the root of binfmt_misc, it will try to take a shared lock on its inode
again, but it is already locked, and the code will get stuck in a deadlock

To reproduce the bug:
$ echo ":iiiii:E::ii::/proc/sys/fs/binfmt_misc/bla:F" &gt; /proc/sys/fs/binfmt_misc/register

backtrace of where the lock occurs (#5):
0  schedule () at ./arch/x86/include/asm/current.h:15
1  0xffffffff81b51237 in rwsem_down_read_slowpath (sem=0xffff888003b202e0, count=&lt;optimized out&gt;, state=state@entry=2) at kernel/locking/rwsem.c:992
2  0xffffffff81b5150a in __down_read_common (state=2, sem=&lt;optimized out&gt;) at kernel/locking/rwsem.c:1213
3  __down_read (sem=&lt;optimized out&gt;) at kernel/locking/rwsem.c:1222
4  down_read (sem=&lt;optimized out&gt;) at kernel/locking/rwsem.c:1355
5  0xffffffff811ee22a in inode_lock_shared (inode=&lt;optimized out&gt;) at ./include/linux/fs.h:783
6  open_last_lookups (op=0xffffc9000022fe34, file=0xffff888004098600, nd=0xffffc9000022fd10) at fs/namei.c:3177
7  path_openat (nd=nd@entry=0xffffc9000022fd10, op=op@entry=0xffffc9000022fe34, flags=flags@entry=65) at fs/namei.c:3366
8  0xffffffff811efe1c in do_filp_open (dfd=&lt;optimized out&gt;, pathname=pathname@entry=0xffff8880031b9000, op=op@entry=0xffffc9000022fe34) at fs/namei.c:3396
9  0xffffffff811e493f in do_open_execat (fd=fd@entry=-100, name=name@entry=0xffff8880031b9000, flags=&lt;optimized out&gt;, flags@entry=0) at fs/exec.c:913
10 0xffffffff811e4a92 in open_exec (name=&lt;optimized out&gt;) at fs/exec.c:948
11 0xffffffff8124aa84 in bm_register_write (file=&lt;optimized out&gt;, buffer=&lt;optimized out&gt;, count=19, ppos=&lt;optimized out&gt;) at fs/binfmt_misc.c:682
12 0xffffffff811decd2 in vfs_write (file=file@entry=0xffff888004098500, buf=buf@entry=0xa758d0 ":iiiii:E::ii::i:CF
", count=count@entry=19, pos=pos@entry=0xffffc9000022ff10) at fs/read_write.c:603
13 0xffffffff811defda in ksys_write (fd=&lt;optimized out&gt;, buf=0xa758d0 ":iiiii:E::ii::i:CF
", count=19) at fs/read_write.c:658
14 0xffffffff81b49813 in do_syscall_64 (nr=&lt;optimized out&gt;, regs=0xffffc9000022ff58) at arch/x86/entry/common.c:46
15 0xffffffff81c0007c in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120

To solve the issue, the open_exec call is moved to before the write
lock is taken by bm_register_write

Link: https://lkml.kernel.org/r/20210228224414.95962-1-liorribak@gmail.com
Fixes: 948b701a607f1 ("binfmt_misc: add persistent opened binary handler for containers")
Signed-off-by: Lior Ribak &lt;liorribak@gmail.com&gt;
Acked-by: Helge Deller &lt;deller@gmx.de&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
