<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/fs/exec.c, branch v6.11</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>exec: Fix ToCToU between perm check and set-uid/gid usage</title>
<updated>2024-08-13T20:24:29+00:00</updated>
<author>
<name>Kees Cook</name>
<email>kees@kernel.org</email>
</author>
<published>2024-08-08T18:39:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=f50733b45d865f91db90919f8311e2127ce5a0cb'/>
<id>f50733b45d865f91db90919f8311e2127ce5a0cb</id>
<content type='text'>
When opening a file for exec via do_filp_open(), permission checking is
done against the file's metadata at that moment, and on success, a file
pointer is passed back. Much later in the execve() code path, the file
metadata (specifically mode, uid, and gid) is used to determine if/how
to set the uid and gid. However, those values may have changed since the
permissions check, meaning the execution may gain unintended privileges.

For example, if a file could change permissions from executable and not
set-id:

---------x 1 root root 16048 Aug  7 13:16 target

to set-id and non-executable:

---S------ 1 root root 16048 Aug  7 13:16 target

it is possible to gain root privileges when execution should have been
disallowed.

While this race condition is rare in real-world scenarios, it has been
observed (and proven exploitable) when package managers are updating
the setuid bits of installed programs. Such files start with being
world-executable but then are adjusted to be group-exec with a set-uid
bit. For example, "chmod o-x,u+s target" makes "target" executable only
by uid "root" and gid "cdrom", while also becoming setuid-root:

-rwxr-xr-x 1 root cdrom 16048 Aug  7 13:16 target

becomes:

-rwsr-xr-- 1 root cdrom 16048 Aug  7 13:16 target

But racing the chmod means users without group "cdrom" membership can
get the permission to execute "target" just before the chmod, and when
the chmod finishes, the exec reaches brpm_fill_uid(), and performs the
setuid to root, violating the expressed authorization of "only cdrom
group members can setuid to root".

Re-check that we still have execute permissions in case the metadata
has changed. It would be better to keep a copy from the perm-check time,
but until we can do that refactoring, the least-bad option is to do a
full inode_permission() call (under inode lock). It is understood that
this is safe against dead-locks, but hardly optimal.

Reported-by: Marco Vanotti &lt;mvanotti@google.com&gt;
Tested-by: Marco Vanotti &lt;mvanotti@google.com&gt;
Suggested-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: stable@vger.kernel.org
Cc: Eric Biederman &lt;ebiederm@xmission.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Kees Cook &lt;kees@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When opening a file for exec via do_filp_open(), permission checking is
done against the file's metadata at that moment, and on success, a file
pointer is passed back. Much later in the execve() code path, the file
metadata (specifically mode, uid, and gid) is used to determine if/how
to set the uid and gid. However, those values may have changed since the
permissions check, meaning the execution may gain unintended privileges.

For example, if a file could change permissions from executable and not
set-id:

---------x 1 root root 16048 Aug  7 13:16 target

to set-id and non-executable:

---S------ 1 root root 16048 Aug  7 13:16 target

it is possible to gain root privileges when execution should have been
disallowed.

While this race condition is rare in real-world scenarios, it has been
observed (and proven exploitable) when package managers are updating
the setuid bits of installed programs. Such files start with being
world-executable but then are adjusted to be group-exec with a set-uid
bit. For example, "chmod o-x,u+s target" makes "target" executable only
by uid "root" and gid "cdrom", while also becoming setuid-root:

-rwxr-xr-x 1 root cdrom 16048 Aug  7 13:16 target

becomes:

-rwsr-xr-- 1 root cdrom 16048 Aug  7 13:16 target

But racing the chmod means users without group "cdrom" membership can
get the permission to execute "target" just before the chmod, and when
the chmod finishes, the exec reaches brpm_fill_uid(), and performs the
setuid to root, violating the expressed authorization of "only cdrom
group members can setuid to root".

Re-check that we still have execute permissions in case the metadata
has changed. It would be better to keep a copy from the perm-check time,
but until we can do that refactoring, the least-bad option is to do a
full inode_permission() call (under inode lock). It is understood that
this is safe against dead-locks, but hardly optimal.

Reported-by: Marco Vanotti &lt;mvanotti@google.com&gt;
Tested-by: Marco Vanotti &lt;mvanotti@google.com&gt;
Suggested-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: stable@vger.kernel.org
Cc: Eric Biederman &lt;ebiederm@xmission.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Kees Cook &lt;kees@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>sysctl: treewide: constify the ctl_table argument of proc_handlers</title>
<updated>2024-07-24T18:59:29+00:00</updated>
<author>
<name>Joel Granados</name>
<email>j.granados@samsung.com</email>
</author>
<published>2024-07-24T18:59:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=78eb4ea25cd5fdbdae7eb9fdf87b99195ff67508'/>
<id>78eb4ea25cd5fdbdae7eb9fdf87b99195ff67508</id>
<content type='text'>
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.

This patch has been generated by the following coccinelle script:

```
  virtual patch

  @r1@
  identifier ctl, write, buffer, lenp, ppos;
  identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

  @r2@
  identifier func, ctl, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos)
  { ... }

  @r3@
  identifier func;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int , void *, size_t *, loff_t *);

  @r4@
  identifier func, ctl;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int , void *, size_t *, loff_t *);

  @r5@
  identifier func, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

```

* Code formatting was adjusted in xfs_sysctl.c to comply with code
  conventions. The xfs_stats_clear_proc_handler,
  xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
  adjusted.

* The ctl_table argument in proc_watchdog_common was const qualified.
  This is called from a proc_handler itself and is calling back into
  another proc_handler, making it necessary to change it as part of the
  proc_handler migration.

Co-developed-by: Thomas Weißschuh &lt;linux@weissschuh.net&gt;
Signed-off-by: Thomas Weißschuh &lt;linux@weissschuh.net&gt;
Co-developed-by: Joel Granados &lt;j.granados@samsung.com&gt;
Signed-off-by: Joel Granados &lt;j.granados@samsung.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.

This patch has been generated by the following coccinelle script:

```
  virtual patch

  @r1@
  identifier ctl, write, buffer, lenp, ppos;
  identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

  @r2@
  identifier func, ctl, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos)
  { ... }

  @r3@
  identifier func;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int , void *, size_t *, loff_t *);

  @r4@
  identifier func, ctl;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int , void *, size_t *, loff_t *);

  @r5@
  identifier func, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

```

* Code formatting was adjusted in xfs_sysctl.c to comply with code
  conventions. The xfs_stats_clear_proc_handler,
  xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
  adjusted.

* The ctl_table argument in proc_watchdog_common was const qualified.
  This is called from a proc_handler itself and is calling back into
  another proc_handler, making it necessary to change it as part of the
  proc_handler migration.

Co-developed-by: Thomas Weißschuh &lt;linux@weissschuh.net&gt;
Signed-off-by: Thomas Weißschuh &lt;linux@weissschuh.net&gt;
Co-developed-by: Joel Granados &lt;j.granados@samsung.com&gt;
Signed-off-by: Joel Granados &lt;j.granados@samsung.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge tag 'execve-v6.11-rc1-fix1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux</title>
<updated>2024-07-24T00:30:42+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2024-07-24T00:30:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=e9e969797bae359fd463f7617ad875bca2771586'/>
<id>e9e969797bae359fd463f7617ad875bca2771586</id>
<content type='text'>
Pull execve fix from Kees Cook:
 "This moves the exec and binfmt_elf tests out of your way and into the
  tests/ subdirectory, following the newly ratified KUnit naming
  conventions. :)"

* tag 'execve-v6.11-rc1-fix1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  execve: Move KUnit tests to tests/ subdirectory
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull execve fix from Kees Cook:
 "This moves the exec and binfmt_elf tests out of your way and into the
  tests/ subdirectory, following the newly ratified KUnit naming
  conventions. :)"

* tag 'execve-v6.11-rc1-fix1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  execve: Move KUnit tests to tests/ subdirectory
</pre>
</div>
</content>
</entry>
<entry>
<title>execve: Move KUnit tests to tests/ subdirectory</title>
<updated>2024-07-23T01:25:47+00:00</updated>
<author>
<name>Kees Cook</name>
<email>kees@kernel.org</email>
</author>
<published>2024-07-20T17:03:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=b6f5ee4d53019443fb99dd23bc08680b1244ccfa'/>
<id>b6f5ee4d53019443fb99dd23bc08680b1244ccfa</id>
<content type='text'>
Move the exec KUnit tests into a separate directory to avoid polluting
the local directory namespace. Additionally update MAINTAINERS for the
new files.

Reviewed-by: David Gow &lt;davidgow@google.com&gt;
Reviewed-by: SeongJae Park &lt;sj@kernel.org&gt;
Acked-by: Christian Brauner &lt;brauner@kernel.org&gt;
Link: https://lore.kernel.org/r/20240720170310.it.942-kees@kernel.org
Signed-off-by: Kees Cook &lt;kees@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Move the exec KUnit tests into a separate directory to avoid polluting
the local directory namespace. Additionally update MAINTAINERS for the
new files.

Reviewed-by: David Gow &lt;davidgow@google.com&gt;
Reviewed-by: SeongJae Park &lt;sj@kernel.org&gt;
Acked-by: Christian Brauner &lt;brauner@kernel.org&gt;
Link: https://lore.kernel.org/r/20240720170310.it.942-kees@kernel.org
Signed-off-by: Kees Cook &lt;kees@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge tag 'execve-v6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux</title>
<updated>2024-07-16T19:59:20+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2024-07-16T19:59:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=72fda6c8e553699f6ba8d3ddc34f0bbe7a5898df'/>
<id>72fda6c8e553699f6ba8d3ddc34f0bbe7a5898df</id>
<content type='text'>
Pull execve updates from Kees Cook:

 - Use value of kernel.randomize_va_space once per exec (Alexey
   Dobriyan)

 - Honor PT_LOAD alignment for static PIE

 - Make bprm-&gt;argmin only visible under CONFIG_MMU

 - Add KUnit testing of bprm_stack_limits()

* tag 'execve-v6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  exec: Avoid pathological argc, envc, and bprm-&gt;p values
  execve: Keep bprm-&gt;argmin behind CONFIG_MMU
  ELF: fix kernel.randomize_va_space double read
  exec: Add KUnit test for bprm_stack_limits()
  binfmt_elf: Honor PT_LOAD alignment for static PIE
  binfmt_elf: Calculate total_size earlier
  selftests/exec: Build both static and non-static load_address tests
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull execve updates from Kees Cook:

 - Use value of kernel.randomize_va_space once per exec (Alexey
   Dobriyan)

 - Honor PT_LOAD alignment for static PIE

 - Make bprm-&gt;argmin only visible under CONFIG_MMU

 - Add KUnit testing of bprm_stack_limits()

* tag 'execve-v6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  exec: Avoid pathological argc, envc, and bprm-&gt;p values
  execve: Keep bprm-&gt;argmin behind CONFIG_MMU
  ELF: fix kernel.randomize_va_space double read
  exec: Add KUnit test for bprm_stack_limits()
  binfmt_elf: Honor PT_LOAD alignment for static PIE
  binfmt_elf: Calculate total_size earlier
  selftests/exec: Build both static and non-static load_address tests
</pre>
</div>
</content>
</entry>
<entry>
<title>exec: Avoid pathological argc, envc, and bprm-&gt;p values</title>
<updated>2024-07-14T04:31:58+00:00</updated>
<author>
<name>Kees Cook</name>
<email>kees@kernel.org</email>
</author>
<published>2024-06-21T20:50:44+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=21f93108306026b8066db31c24a097192c8c36c7'/>
<id>21f93108306026b8066db31c24a097192c8c36c7</id>
<content type='text'>
Make sure nothing goes wrong with the string counters or the bprm's
belief about the stack pointer. Add checks and matching self-tests.

Take special care for !CONFIG_MMU, since argmin is not exposed there.

For 32-bit validation, 32-bit UML was used:
$ tools/testing/kunit/kunit.py run \
	--make_options CROSS_COMPILE=i686-linux-gnu- \
	--make_options SUBARCH=i386 \
	exec

For !MMU validation, m68k was used:
$ tools/testing/kunit/kunit.py run \
	--arch m68k --make_option CROSS_COMPILE=m68k-linux-gnu- \
	exec

Link: https://lore.kernel.org/r/20240520021615.741800-2-keescook@chromium.org
Link: https://lore.kernel.org/r/20240621205046.4001362-2-kees@kernel.org
Signed-off-by: Kees Cook &lt;kees@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Make sure nothing goes wrong with the string counters or the bprm's
belief about the stack pointer. Add checks and matching self-tests.

Take special care for !CONFIG_MMU, since argmin is not exposed there.

For 32-bit validation, 32-bit UML was used:
$ tools/testing/kunit/kunit.py run \
	--make_options CROSS_COMPILE=i686-linux-gnu- \
	--make_options SUBARCH=i386 \
	exec

For !MMU validation, m68k was used:
$ tools/testing/kunit/kunit.py run \
	--arch m68k --make_option CROSS_COMPILE=m68k-linux-gnu- \
	exec

Link: https://lore.kernel.org/r/20240520021615.741800-2-keescook@chromium.org
Link: https://lore.kernel.org/r/20240621205046.4001362-2-kees@kernel.org
Signed-off-by: Kees Cook &lt;kees@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>execve: Keep bprm-&gt;argmin behind CONFIG_MMU</title>
<updated>2024-07-14T04:31:57+00:00</updated>
<author>
<name>Kees Cook</name>
<email>kees@kernel.org</email>
</author>
<published>2024-06-21T20:50:43+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=084ebf7ca83e6cb743784f2eecc654193ce064fb'/>
<id>084ebf7ca83e6cb743784f2eecc654193ce064fb</id>
<content type='text'>
When argmin was added in commit 655c16a8ce9c ("exec: separate
MM_ANONPAGES and RLIMIT_STACK accounting"), it was intended only for
validating stack limits on CONFIG_MMU[1]. All checking for reaching the
limit (argmin) is wrapped in CONFIG_MMU ifdef checks, though setting
argmin was not. That argmin is only supposed to be used under CONFIG_MMU
was rediscovered recently[2], and I don't want to trip over this again.

Move argmin's declaration into the existing CONFIG_MMU area, and add
helpers functions so the MMU tests can be consolidated.

Link: https://lore.kernel.org/all/20181126122307.GA1660@redhat.com [1]
Link: https://lore.kernel.org/all/202406211253.7037F69@keescook/ [2]
Link: https://lore.kernel.org/r/20240621205046.4001362-1-kees@kernel.org
Signed-off-by: Kees Cook &lt;kees@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When argmin was added in commit 655c16a8ce9c ("exec: separate
MM_ANONPAGES and RLIMIT_STACK accounting"), it was intended only for
validating stack limits on CONFIG_MMU[1]. All checking for reaching the
limit (argmin) is wrapped in CONFIG_MMU ifdef checks, though setting
argmin was not. That argmin is only supposed to be used under CONFIG_MMU
was rediscovered recently[2], and I don't want to trip over this again.

Move argmin's declaration into the existing CONFIG_MMU area, and add
helpers functions so the MMU tests can be consolidated.

Link: https://lore.kernel.org/all/20181126122307.GA1660@redhat.com [1]
Link: https://lore.kernel.org/all/202406211253.7037F69@keescook/ [2]
Link: https://lore.kernel.org/r/20240621205046.4001362-1-kees@kernel.org
Signed-off-by: Kees Cook &lt;kees@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>exec: Add KUnit test for bprm_stack_limits()</title>
<updated>2024-06-19T20:13:55+00:00</updated>
<author>
<name>Kees Cook</name>
<email>kees@kernel.org</email>
</author>
<published>2024-05-20T00:17:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=60371f43e56bfad1f438621fae286e1de5bb6877'/>
<id>60371f43e56bfad1f438621fae286e1de5bb6877</id>
<content type='text'>
Since bprm_stack_limits() operates with very limited side-effects, add
it as the first exec.c KUnit test. Add to Kconfig and adjust MAINTAINERS
file to include it.

Tested on 64-bit UML:
$ tools/testing/kunit/kunit.py run exec

Link: https://lore.kernel.org/lkml/20240520021615.741800-1-keescook@chromium.org/
Signed-off-by: Kees Cook &lt;kees@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Since bprm_stack_limits() operates with very limited side-effects, add
it as the first exec.c KUnit test. Add to Kconfig and adjust MAINTAINERS
file to include it.

Tested on 64-bit UML:
$ tools/testing/kunit/kunit.py run exec

Link: https://lore.kernel.org/lkml/20240520021615.741800-1-keescook@chromium.org/
Signed-off-by: Kees Cook &lt;kees@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>fs: don't block i_writecount during exec</title>
<updated>2024-06-03T13:52:10+00:00</updated>
<author>
<name>Christian Brauner</name>
<email>brauner@kernel.org</email>
</author>
<published>2024-05-31T13:01:43+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=2a010c41285345da60cece35575b4e0af7e7bf44'/>
<id>2a010c41285345da60cece35575b4e0af7e7bf44</id>
<content type='text'>
Back in 2021 we already discussed removing deny_write_access() for
executables. Back then I was hesistant because I thought that this might
cause issues in userspace. But even back then I had started taking some
notes on what could potentially depend on this and I didn't come up with
a lot so I've changed my mind and I would like to try this.

Here are some of the notes that I took:

(1) The deny_write_access() mechanism is causing really pointless issues
    such as [1]. If a thread in a thread-group opens a file writable,
    then writes some stuff, then closing the file descriptor and then
    calling execve() they can fail the execve() with ETXTBUSY because
    another thread in the thread-group could have concurrently called
    fork(). Multi-threaded libraries such as go suffer from this.

(2) There are userspace attacks that rely on overwriting the binary of a
    running process. These attacks are _mitigated_ but _not at all
    prevented_ from ocurring by the deny_write_access() mechanism.

    I'll go over some details. The clearest example of such attacks was
    the attack against runC in CVE-2019-5736 (cf. [3]).

    An attack could compromise the runC host binary from inside a
    _privileged_ runC container. The malicious binary could then be used
    to take over the host.

    (It is crucial to note that this attack is _not_ possible with
     unprivileged containers. IOW, the setup here is already insecure.)

    The attack can be made when attaching to a running container or when
    starting a container running a specially crafted image. For example,
    when runC attaches to a container the attacker can trick it into
    executing itself.

    This could be done by replacing the target binary inside the
    container with a custom binary pointing back at the runC binary
    itself. As an example, if the target binary was /bin/bash, this
    could be replaced with an executable script specifying the
    interpreter path #!/proc/self/exe.

    As such when /bin/bash is executed inside the container, instead the
    target of /proc/self/exe will be executed. That magic link will
    point to the runc binary on the host. The attacker can then proceed
    to write to the target of /proc/self/exe to try and overwrite the
    runC binary on the host.

    However, this will not succeed because of deny_write_access(). Now,
    one might think that this would prevent the attack but it doesn't.

    To overcome this, the attacker has multiple ways:
    * Open a file descriptor to /proc/self/exe using the O_PATH flag and
      then proceed to reopen the binary as O_WRONLY through
      /proc/self/fd/&lt;nr&gt; and try to write to it in a busy loop from a
      separate process. Ultimately it will succeed when the runC binary
      exits. After this the runC binary is compromised and can be used
      to attack other containers or the host itself.
    * Use a malicious shared library annotating a function in there with
      the constructor attribute making the malicious function run as an
      initializor. The malicious library will then open /proc/self/exe
      for creating a new entry under /proc/self/fd/&lt;nr&gt;. It'll then call
      exec to a) force runC to exit and b) hand the file descriptor off
      to a program that then reopens /proc/self/fd/&lt;nr&gt; for writing
      (which is now possible because runC has exited) and overwriting
      that binary.

    To sum up: the deny_write_access() mechanism doesn't prevent such
    attacks in insecure setups. It just makes them minimally harder.
    That's all.

    The only way back then to prevent this is to create a temporary copy
    of the calling binary itself when it starts or attaches to
    containers. So what I did back then for LXC (and Aleksa for runC)
    was to create an anonymous, in-memory file using the memfd_create()
    system call and to copy itself into the temporary in-memory file,
    which is then sealed to prevent further modifications. This sealed,
    in-memory file copy is then executed instead of the original on-disk
    binary.

    Any compromising write operations from a privileged container to the
    host binary will then write to the temporary in-memory binary and
    not to the host binary on-disk, preserving the integrity of the host
    binary. Also as the temporary, in-memory binary is sealed, writes to
    this will also fail.

    The point is that deny_write_access() is uselss to prevent these
    attacks.

(3) Denying write access to an inode because it's currently used in an
    exec path could easily be done on an LSM level. It might need an
    additional hook but that should be about it.

(4) The MAP_DENYWRITE flag for mmap() has been deprecated a long time
    ago so while we do protect the main executable the bigger portion of
    the things you'd think need protecting such as the shared libraries
    aren't. IOW, we let anyone happily overwrite shared libraries.

(5) We removed all remaining uses of VM_DENYWRITE in [2]. That means:
    (5.1) We removed the legacy uselib() protection for preventing
          overwriting of shared libraries. Nobody cared in 3 years.
    (5.2) We allow write access to the elf interpreter after exec
          completed treating it on a par with shared libraries.

Yes, someone in userspace could potentially be relying on this. It's not
completely out of the realm of possibility but let's find out if that's
actually the case and not guess.

Link: https://github.com/golang/go/issues/22315 [1]
Link: 49624efa65ac ("Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux") [2]
Link: https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736 [3]
Link: https://lwn.net/Articles/866493
Link: https://github.com/golang/go/issues/22220
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/buildid.go#L724
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/exec.go#L1493
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/script/cmds.go#L457
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/test/test.go#L1557
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/os/exec/lp_linux_test.go#L61
Link: https://github.com/buildkite/agent/pull/2736
Link: https://github.com/rust-lang/rust/issues/114554
Link: https://bugs.openjdk.org/browse/JDK-8068370
Link: https://github.com/dotnet/runtime/issues/58964
Link: https://lore.kernel.org/r/20240531-vfs-i_writecount-v1-1-a17bea7ee36b@kernel.org
Reviewed-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Back in 2021 we already discussed removing deny_write_access() for
executables. Back then I was hesistant because I thought that this might
cause issues in userspace. But even back then I had started taking some
notes on what could potentially depend on this and I didn't come up with
a lot so I've changed my mind and I would like to try this.

Here are some of the notes that I took:

(1) The deny_write_access() mechanism is causing really pointless issues
    such as [1]. If a thread in a thread-group opens a file writable,
    then writes some stuff, then closing the file descriptor and then
    calling execve() they can fail the execve() with ETXTBUSY because
    another thread in the thread-group could have concurrently called
    fork(). Multi-threaded libraries such as go suffer from this.

(2) There are userspace attacks that rely on overwriting the binary of a
    running process. These attacks are _mitigated_ but _not at all
    prevented_ from ocurring by the deny_write_access() mechanism.

    I'll go over some details. The clearest example of such attacks was
    the attack against runC in CVE-2019-5736 (cf. [3]).

    An attack could compromise the runC host binary from inside a
    _privileged_ runC container. The malicious binary could then be used
    to take over the host.

    (It is crucial to note that this attack is _not_ possible with
     unprivileged containers. IOW, the setup here is already insecure.)

    The attack can be made when attaching to a running container or when
    starting a container running a specially crafted image. For example,
    when runC attaches to a container the attacker can trick it into
    executing itself.

    This could be done by replacing the target binary inside the
    container with a custom binary pointing back at the runC binary
    itself. As an example, if the target binary was /bin/bash, this
    could be replaced with an executable script specifying the
    interpreter path #!/proc/self/exe.

    As such when /bin/bash is executed inside the container, instead the
    target of /proc/self/exe will be executed. That magic link will
    point to the runc binary on the host. The attacker can then proceed
    to write to the target of /proc/self/exe to try and overwrite the
    runC binary on the host.

    However, this will not succeed because of deny_write_access(). Now,
    one might think that this would prevent the attack but it doesn't.

    To overcome this, the attacker has multiple ways:
    * Open a file descriptor to /proc/self/exe using the O_PATH flag and
      then proceed to reopen the binary as O_WRONLY through
      /proc/self/fd/&lt;nr&gt; and try to write to it in a busy loop from a
      separate process. Ultimately it will succeed when the runC binary
      exits. After this the runC binary is compromised and can be used
      to attack other containers or the host itself.
    * Use a malicious shared library annotating a function in there with
      the constructor attribute making the malicious function run as an
      initializor. The malicious library will then open /proc/self/exe
      for creating a new entry under /proc/self/fd/&lt;nr&gt;. It'll then call
      exec to a) force runC to exit and b) hand the file descriptor off
      to a program that then reopens /proc/self/fd/&lt;nr&gt; for writing
      (which is now possible because runC has exited) and overwriting
      that binary.

    To sum up: the deny_write_access() mechanism doesn't prevent such
    attacks in insecure setups. It just makes them minimally harder.
    That's all.

    The only way back then to prevent this is to create a temporary copy
    of the calling binary itself when it starts or attaches to
    containers. So what I did back then for LXC (and Aleksa for runC)
    was to create an anonymous, in-memory file using the memfd_create()
    system call and to copy itself into the temporary in-memory file,
    which is then sealed to prevent further modifications. This sealed,
    in-memory file copy is then executed instead of the original on-disk
    binary.

    Any compromising write operations from a privileged container to the
    host binary will then write to the temporary in-memory binary and
    not to the host binary on-disk, preserving the integrity of the host
    binary. Also as the temporary, in-memory binary is sealed, writes to
    this will also fail.

    The point is that deny_write_access() is uselss to prevent these
    attacks.

(3) Denying write access to an inode because it's currently used in an
    exec path could easily be done on an LSM level. It might need an
    additional hook but that should be about it.

(4) The MAP_DENYWRITE flag for mmap() has been deprecated a long time
    ago so while we do protect the main executable the bigger portion of
    the things you'd think need protecting such as the shared libraries
    aren't. IOW, we let anyone happily overwrite shared libraries.

(5) We removed all remaining uses of VM_DENYWRITE in [2]. That means:
    (5.1) We removed the legacy uselib() protection for preventing
          overwriting of shared libraries. Nobody cared in 3 years.
    (5.2) We allow write access to the elf interpreter after exec
          completed treating it on a par with shared libraries.

Yes, someone in userspace could potentially be relying on this. It's not
completely out of the realm of possibility but let's find out if that's
actually the case and not guess.

Link: https://github.com/golang/go/issues/22315 [1]
Link: 49624efa65ac ("Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux") [2]
Link: https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736 [3]
Link: https://lwn.net/Articles/866493
Link: https://github.com/golang/go/issues/22220
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/buildid.go#L724
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/exec.go#L1493
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/script/cmds.go#L457
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/test/test.go#L1557
Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/os/exec/lp_linux_test.go#L61
Link: https://github.com/buildkite/agent/pull/2736
Link: https://github.com/rust-lang/rust/issues/114554
Link: https://bugs.openjdk.org/browse/JDK-8068370
Link: https://github.com/dotnet/runtime/issues/58964
Link: https://lore.kernel.org/r/20240531-vfs-i_writecount-v1-1-a17bea7ee36b@kernel.org
Reviewed-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm</title>
<updated>2024-05-19T16:21:03+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2024-05-19T16:21:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=61307b7be41a1f1039d1d1368810a1d92cb97b44'/>
<id>61307b7be41a1f1039d1d1368810a1d92cb97b44</id>
<content type='text'>
Pull mm updates from Andrew Morton:
 "The usual shower of singleton fixes and minor series all over MM,
  documented (hopefully adequately) in the respective changelogs.
  Notable series include:

   - Lucas Stach has provided some page-mapping cleanup/consolidation/
     maintainability work in the series "mm/treewide: Remove pXd_huge()
     API".

   - In the series "Allow migrate on protnone reference with
     MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
     MPOL_PREFERRED_MANY mode, yielding almost doubled performance in
     one test.

   - In their series "Memory allocation profiling" Kent Overstreet and
     Suren Baghdasaryan have contributed a means of determining (via
     /proc/allocinfo) whereabouts in the kernel memory is being
     allocated: number of calls and amount of memory.

   - Matthew Wilcox has provided the series "Various significant MM
     patches" which does a number of rather unrelated things, but in
     largely similar code sites.

   - In his series "mm: page_alloc: freelist migratetype hygiene"
     Johannes Weiner has fixed the page allocator's handling of
     migratetype requests, with resulting improvements in compaction
     efficiency.

   - In the series "make the hugetlb migration strategy consistent"
     Baolin Wang has fixed a hugetlb migration issue, which should
     improve hugetlb allocation reliability.

   - Liu Shixin has hit an I/O meltdown caused by readahead in a
     memory-tight memcg. Addressed in the series "Fix I/O high when
     memory almost met memcg limit".

   - In the series "mm/filemap: optimize folio adding and splitting"
     Kairui Song has optimized pagecache insertion, yielding ~10%
     performance improvement in one test.

   - Baoquan He has cleaned up and consolidated the early zone
     initialization code in the series "mm/mm_init.c: refactor
     free_area_init_core()".

   - Baoquan has also redone some MM initializatio code in the series
     "mm/init: minor clean up and improvement".

   - MM helper cleanups from Christoph Hellwig in his series "remove
     follow_pfn".

   - More cleanups from Matthew Wilcox in the series "Various
     page-&gt;flags cleanups".

   - Vlastimil Babka has contributed maintainability improvements in the
     series "memcg_kmem hooks refactoring".

   - More folio conversions and cleanups in Matthew Wilcox's series:
	"Convert huge_zero_page to huge_zero_folio"
	"khugepaged folio conversions"
	"Remove page_idle and page_young wrappers"
	"Use folio APIs in procfs"
	"Clean up __folio_put()"
	"Some cleanups for memory-failure"
	"Remove page_mapping()"
	"More folio compat code removal"

   - David Hildenbrand chipped in with "fs/proc/task_mmu: convert
     hugetlb functions to work on folis".

   - Code consolidation and cleanup work related to GUP's handling of
     hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".

   - Rick Edgecombe has developed some fixes to stack guard gaps in the
     series "Cover a guard gap corner case".

   - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the
     series "mm/ksm: fix ksm exec support for prctl".

   - Baolin Wang has implemented NUMA balancing for multi-size THPs.
     This is a simple first-cut implementation for now. The series is
     "support multi-size THP numa balancing".

   - Cleanups to vma handling helper functions from Matthew Wilcox in
     the series "Unify vma_address and vma_pgoff_address".

   - Some selftests maintenance work from Dev Jain in the series
     "selftests/mm: mremap_test: Optimizations and style fixes".

   - Improvements to the swapping of multi-size THPs from Ryan Roberts
     in the series "Swap-out mTHP without splitting".

   - Kefeng Wang has significantly optimized the handling of arm64's
     permission page faults in the series
	"arch/mm/fault: accelerate pagefault when badaccess"
	"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"

   - GUP cleanups from David Hildenbrand in "mm/gup: consistently call
     it GUP-fast".

   - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault
     path to use struct vm_fault".

   - selftests build fixes from John Hubbard in the series "Fix
     selftests/mm build without requiring "make headers"".

   - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
     series "Improved Memory Tier Creation for CPUless NUMA Nodes".
     Fixes the initialization code so that migration between different
     memory types works as intended.

   - David Hildenbrand has improved follow_pte() and fixed an errant
     driver in the series "mm: follow_pte() improvements and acrn
     follow_pte() fixes".

   - David also did some cleanup work on large folio mapcounts in his
     series "mm: mapcount for large folios + page_mapcount() cleanups".

   - Folio conversions in KSM in Alex Shi's series "transfer page to
     folio in KSM".

   - Barry Song has added some sysfs stats for monitoring multi-size
     THP's in the series "mm: add per-order mTHP alloc and swpout
     counters".

   - Some zswap cleanups from Yosry Ahmed in the series "zswap
     same-filled and limit checking cleanups".

   - Matthew Wilcox has been looking at buffer_head code and found the
     documentation to be lacking. The series is "Improve buffer head
     documentation".

   - Multi-size THPs get more work, this time from Lance Yang. His
     series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free"
     optimizes the freeing of these things.

   - Kemeng Shi has added more userspace-visible writeback
     instrumentation in the series "Improve visibility of writeback".

   - Kemeng Shi then sent some maintenance work on top in the series
     "Fix and cleanups to page-writeback".

   - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in
     the series "Improve anon_vma scalability for anon VMAs". Intel's
     test bot reported an improbable 3x improvement in one test.

   - SeongJae Park adds some DAMON feature work in the series
	"mm/damon: add a DAMOS filter type for page granularity access recheck"
	"selftests/damon: add DAMOS quota goal test"

   - Also some maintenance work in the series
	"mm/damon/paddr: simplify page level access re-check for pageout"
	"mm/damon: misc fixes and improvements"

   - David Hildenbrand has disabled some known-to-fail selftests ni the
     series "selftests: mm: cow: flag vmsplice() hugetlb tests as
     XFAIL".

   - memcg metadata storage optimizations from Shakeel Butt in "memcg:
     reduce memory consumption by memcg stats".

   - DAX fixes and maintenance work from Vishal Verma in the series
     "dax/bus.c: Fixups for dax-bus locking""

* tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits)
  memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order
  selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime
  mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp
  mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault
  selftests: cgroup: add tests to verify the zswap writeback path
  mm: memcg: make alloc_mem_cgroup_per_node_info() return bool
  mm/damon/core: fix return value from damos_wmark_metric_value
  mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED
  selftests: cgroup: remove redundant enabling of memory controller
  Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree
  Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT
  Docs/mm/damon/design: use a list for supported filters
  Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command
  Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file
  selftests/damon: classify tests for functionalities and regressions
  selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None'
  selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts
  selftests/damon/_damon_sysfs: check errors from nr_schemes file reads
  mm/damon/core: initialize -&gt;esz_bp from damos_quota_init_priv()
  selftests/damon: add a test for DAMOS quota goal
  ...
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull mm updates from Andrew Morton:
 "The usual shower of singleton fixes and minor series all over MM,
  documented (hopefully adequately) in the respective changelogs.
  Notable series include:

   - Lucas Stach has provided some page-mapping cleanup/consolidation/
     maintainability work in the series "mm/treewide: Remove pXd_huge()
     API".

   - In the series "Allow migrate on protnone reference with
     MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
     MPOL_PREFERRED_MANY mode, yielding almost doubled performance in
     one test.

   - In their series "Memory allocation profiling" Kent Overstreet and
     Suren Baghdasaryan have contributed a means of determining (via
     /proc/allocinfo) whereabouts in the kernel memory is being
     allocated: number of calls and amount of memory.

   - Matthew Wilcox has provided the series "Various significant MM
     patches" which does a number of rather unrelated things, but in
     largely similar code sites.

   - In his series "mm: page_alloc: freelist migratetype hygiene"
     Johannes Weiner has fixed the page allocator's handling of
     migratetype requests, with resulting improvements in compaction
     efficiency.

   - In the series "make the hugetlb migration strategy consistent"
     Baolin Wang has fixed a hugetlb migration issue, which should
     improve hugetlb allocation reliability.

   - Liu Shixin has hit an I/O meltdown caused by readahead in a
     memory-tight memcg. Addressed in the series "Fix I/O high when
     memory almost met memcg limit".

   - In the series "mm/filemap: optimize folio adding and splitting"
     Kairui Song has optimized pagecache insertion, yielding ~10%
     performance improvement in one test.

   - Baoquan He has cleaned up and consolidated the early zone
     initialization code in the series "mm/mm_init.c: refactor
     free_area_init_core()".

   - Baoquan has also redone some MM initializatio code in the series
     "mm/init: minor clean up and improvement".

   - MM helper cleanups from Christoph Hellwig in his series "remove
     follow_pfn".

   - More cleanups from Matthew Wilcox in the series "Various
     page-&gt;flags cleanups".

   - Vlastimil Babka has contributed maintainability improvements in the
     series "memcg_kmem hooks refactoring".

   - More folio conversions and cleanups in Matthew Wilcox's series:
	"Convert huge_zero_page to huge_zero_folio"
	"khugepaged folio conversions"
	"Remove page_idle and page_young wrappers"
	"Use folio APIs in procfs"
	"Clean up __folio_put()"
	"Some cleanups for memory-failure"
	"Remove page_mapping()"
	"More folio compat code removal"

   - David Hildenbrand chipped in with "fs/proc/task_mmu: convert
     hugetlb functions to work on folis".

   - Code consolidation and cleanup work related to GUP's handling of
     hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".

   - Rick Edgecombe has developed some fixes to stack guard gaps in the
     series "Cover a guard gap corner case".

   - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the
     series "mm/ksm: fix ksm exec support for prctl".

   - Baolin Wang has implemented NUMA balancing for multi-size THPs.
     This is a simple first-cut implementation for now. The series is
     "support multi-size THP numa balancing".

   - Cleanups to vma handling helper functions from Matthew Wilcox in
     the series "Unify vma_address and vma_pgoff_address".

   - Some selftests maintenance work from Dev Jain in the series
     "selftests/mm: mremap_test: Optimizations and style fixes".

   - Improvements to the swapping of multi-size THPs from Ryan Roberts
     in the series "Swap-out mTHP without splitting".

   - Kefeng Wang has significantly optimized the handling of arm64's
     permission page faults in the series
	"arch/mm/fault: accelerate pagefault when badaccess"
	"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"

   - GUP cleanups from David Hildenbrand in "mm/gup: consistently call
     it GUP-fast".

   - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault
     path to use struct vm_fault".

   - selftests build fixes from John Hubbard in the series "Fix
     selftests/mm build without requiring "make headers"".

   - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
     series "Improved Memory Tier Creation for CPUless NUMA Nodes".
     Fixes the initialization code so that migration between different
     memory types works as intended.

   - David Hildenbrand has improved follow_pte() and fixed an errant
     driver in the series "mm: follow_pte() improvements and acrn
     follow_pte() fixes".

   - David also did some cleanup work on large folio mapcounts in his
     series "mm: mapcount for large folios + page_mapcount() cleanups".

   - Folio conversions in KSM in Alex Shi's series "transfer page to
     folio in KSM".

   - Barry Song has added some sysfs stats for monitoring multi-size
     THP's in the series "mm: add per-order mTHP alloc and swpout
     counters".

   - Some zswap cleanups from Yosry Ahmed in the series "zswap
     same-filled and limit checking cleanups".

   - Matthew Wilcox has been looking at buffer_head code and found the
     documentation to be lacking. The series is "Improve buffer head
     documentation".

   - Multi-size THPs get more work, this time from Lance Yang. His
     series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free"
     optimizes the freeing of these things.

   - Kemeng Shi has added more userspace-visible writeback
     instrumentation in the series "Improve visibility of writeback".

   - Kemeng Shi then sent some maintenance work on top in the series
     "Fix and cleanups to page-writeback".

   - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in
     the series "Improve anon_vma scalability for anon VMAs". Intel's
     test bot reported an improbable 3x improvement in one test.

   - SeongJae Park adds some DAMON feature work in the series
	"mm/damon: add a DAMOS filter type for page granularity access recheck"
	"selftests/damon: add DAMOS quota goal test"

   - Also some maintenance work in the series
	"mm/damon/paddr: simplify page level access re-check for pageout"
	"mm/damon: misc fixes and improvements"

   - David Hildenbrand has disabled some known-to-fail selftests ni the
     series "selftests: mm: cow: flag vmsplice() hugetlb tests as
     XFAIL".

   - memcg metadata storage optimizations from Shakeel Butt in "memcg:
     reduce memory consumption by memcg stats".

   - DAX fixes and maintenance work from Vishal Verma in the series
     "dax/bus.c: Fixups for dax-bus locking""

* tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits)
  memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order
  selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime
  mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp
  mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault
  selftests: cgroup: add tests to verify the zswap writeback path
  mm: memcg: make alloc_mem_cgroup_per_node_info() return bool
  mm/damon/core: fix return value from damos_wmark_metric_value
  mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED
  selftests: cgroup: remove redundant enabling of memory controller
  Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree
  Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT
  Docs/mm/damon/design: use a list for supported filters
  Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command
  Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file
  selftests/damon: classify tests for functionalities and regressions
  selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None'
  selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts
  selftests/damon/_damon_sysfs: check errors from nr_schemes file reads
  mm/damon/core: initialize -&gt;esz_bp from damos_quota_init_priv()
  selftests/damon: add a test for DAMOS quota goal
  ...
</pre>
</div>
</content>
</entry>
</feed>
