<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/mm/memfd.c, branch v6.5</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>memfd: check for non-NULL file_seals in memfd_create() syscall</title>
<updated>2023-06-19T20:19:31+00:00</updated>
<author>
<name>Roberto Sassu</name>
<email>roberto.sassu@huawei.com</email>
</author>
<published>2023-06-07T13:24:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=935d44acf621aa0688fef8312dec3e5940f38f4e'/>
<id>935d44acf621aa0688fef8312dec3e5940f38f4e</id>
<content type='text'>
Ensure that file_seals is non-NULL before using it in the memfd_create()
syscall.  One situation in which memfd_file_seals_ptr() could return a
NULL pointer when CONFIG_SHMEM=n, oopsing the kernel.

Link: https://lkml.kernel.org/r/20230607132427.2867435-1-roberto.sassu@huaweicloud.com
Fixes: 47b9012ecdc7 ("shmem: add sealing support to hugetlb-backed memfd")
Signed-off-by: Roberto Sassu &lt;roberto.sassu@huawei.com&gt;
Cc: Marc-Andr Lureau &lt;marcandre.lureau@redhat.com&gt;
Cc: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Ensure that file_seals is non-NULL before using it in the memfd_create()
syscall.  One situation in which memfd_file_seals_ptr() could return a
NULL pointer when CONFIG_SHMEM=n, oopsing the kernel.

Link: https://lkml.kernel.org/r/20230607132427.2867435-1-roberto.sassu@huaweicloud.com
Fixes: 47b9012ecdc7 ("shmem: add sealing support to hugetlb-backed memfd")
Signed-off-by: Roberto Sassu &lt;roberto.sassu@huawei.com&gt;
Cc: Marc-Andr Lureau &lt;marcandre.lureau@redhat.com&gt;
Cc: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>memfd: pass argument of memfd_fcntl as int</title>
<updated>2023-04-18T23:30:11+00:00</updated>
<author>
<name>Luca Vizzarro</name>
<email>Luca.Vizzarro@arm.com</email>
</author>
<published>2023-04-14T15:24:58+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=f7b8f70ba44fb63df47b6fc3204d49ac2885de64'/>
<id>f7b8f70ba44fb63df47b6fc3204d49ac2885de64</id>
<content type='text'>
The interface for fcntl expects the argument passed for the command
F_ADD_SEALS to be of type int.  The current code wrongly treats it as a
long.  In order to avoid access to undefined bits, we should explicitly
cast the argument to int.

This commit changes the signature of all the related and helper functions
so that they treat the argument as int instead of long.

Link: https://lkml.kernel.org/r/20230414152459.816046-5-Luca.Vizzarro@arm.com
Signed-off-by: Luca Vizzarro &lt;Luca.Vizzarro@arm.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: Jeff Layton &lt;jlayton@kernel.org&gt;
Cc: Chuck Lever &lt;chuck.lever@oracle.com&gt;
Cc: Kevin Brodsky &lt;Kevin.Brodsky@arm.com&gt;
Cc: Vincenzo Frascino &lt;Vincenzo.Frascino@arm.com&gt;
Cc: Szabolcs Nagy &lt;Szabolcs.Nagy@arm.com&gt;
Cc: "Theodore Ts'o" &lt;tytso@mit.edu&gt;
Cc: David Laight &lt;David.Laight@ACULAB.com&gt;
Cc: Mark Rutland &lt;Mark.Rutland@arm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The interface for fcntl expects the argument passed for the command
F_ADD_SEALS to be of type int.  The current code wrongly treats it as a
long.  In order to avoid access to undefined bits, we should explicitly
cast the argument to int.

This commit changes the signature of all the related and helper functions
so that they treat the argument as int instead of long.

Link: https://lkml.kernel.org/r/20230414152459.816046-5-Luca.Vizzarro@arm.com
Signed-off-by: Luca Vizzarro &lt;Luca.Vizzarro@arm.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: Jeff Layton &lt;jlayton@kernel.org&gt;
Cc: Chuck Lever &lt;chuck.lever@oracle.com&gt;
Cc: Kevin Brodsky &lt;Kevin.Brodsky@arm.com&gt;
Cc: Vincenzo Frascino &lt;Vincenzo.Frascino@arm.com&gt;
Cc: Szabolcs Nagy &lt;Szabolcs.Nagy@arm.com&gt;
Cc: "Theodore Ts'o" &lt;tytso@mit.edu&gt;
Cc: David Laight &lt;David.Laight@ACULAB.com&gt;
Cc: Mark Rutland &lt;Mark.Rutland@arm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm/memfd: add write seals when apply SEAL_EXEC to executable memfd</title>
<updated>2023-01-19T01:12:37+00:00</updated>
<author>
<name>Jeff Xu</name>
<email>jeffxu@google.com</email>
</author>
<published>2022-12-15T00:12:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=c4f75bc8bd6b3d62665e1f5400c419540edb5601'/>
<id>c4f75bc8bd6b3d62665e1f5400c419540edb5601</id>
<content type='text'>
In order to avoid WX mappings, add F_SEAL_WRITE when apply F_SEAL_EXEC to
an executable memfd, so W^X from start.

This implys application need to fill the content of the memfd first, after
F_SEAL_EXEC is applied, application can no longer modify the content of
the memfd.

Typically, application seals the memfd right after writing to it.
For example:
1. memfd_create(MFD_EXEC).
2. write() code to the memfd.
3. fcntl(F_ADD_SEALS, F_SEAL_EXEC) to convert the memfd to W^X.
4. call exec() on the memfd.

Link: https://lkml.kernel.org/r/20221215001205.51969-5-jeffxu@google.com
Signed-off-by: Jeff Xu &lt;jeffxu@google.com&gt;
Reviewed-by: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Daniel Verkamp &lt;dverkamp@chromium.org&gt;
Cc: David Herrmann &lt;dh.herrmann@gmail.com&gt;
Cc: Dmitry Torokhov &lt;dmitry.torokhov@gmail.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Jorge Lucangeli Obes &lt;jorgelo@chromium.org&gt;
Cc: kernel test robot &lt;lkp@intel.com&gt;
Cc: Shuah Khan &lt;skhan@linuxfoundation.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In order to avoid WX mappings, add F_SEAL_WRITE when apply F_SEAL_EXEC to
an executable memfd, so W^X from start.

This implys application need to fill the content of the memfd first, after
F_SEAL_EXEC is applied, application can no longer modify the content of
the memfd.

Typically, application seals the memfd right after writing to it.
For example:
1. memfd_create(MFD_EXEC).
2. write() code to the memfd.
3. fcntl(F_ADD_SEALS, F_SEAL_EXEC) to convert the memfd to W^X.
4. call exec() on the memfd.

Link: https://lkml.kernel.org/r/20221215001205.51969-5-jeffxu@google.com
Signed-off-by: Jeff Xu &lt;jeffxu@google.com&gt;
Reviewed-by: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Daniel Verkamp &lt;dverkamp@chromium.org&gt;
Cc: David Herrmann &lt;dh.herrmann@gmail.com&gt;
Cc: Dmitry Torokhov &lt;dmitry.torokhov@gmail.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Jorge Lucangeli Obes &lt;jorgelo@chromium.org&gt;
Cc: kernel test robot &lt;lkp@intel.com&gt;
Cc: Shuah Khan &lt;skhan@linuxfoundation.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC</title>
<updated>2023-01-19T01:12:37+00:00</updated>
<author>
<name>Jeff Xu</name>
<email>jeffxu@google.com</email>
</author>
<published>2022-12-15T00:12:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=105ff5339f498af74e60d7662c8f1c4d21f1342d'/>
<id>105ff5339f498af74e60d7662c8f1c4d21f1342d</id>
<content type='text'>
The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to set
executable bit at creation time (memfd_create).

When MFD_NOEXEC_SEAL is set, memfd is created without executable bit
(mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to be
executable (mode: 0777) after creation.

when MFD_EXEC flag is set, memfd is created with executable bit
(mode:0777), this is the same as the old behavior of memfd_create.

The new pid namespaced sysctl vm.memfd_noexec has 3 values:
0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
        MFD_EXEC was set.
1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
        MFD_NOEXEC_SEAL was set.
2: memfd_create() without MFD_NOEXEC_SEAL will be rejected.

The sysctl allows finer control of memfd_create for old-software that
doesn't set the executable bit, for example, a container with
vm.memfd_noexec=1 means the old-software will create non-executable memfd
by default.  Also, the value of memfd_noexec is passed to child namespace
at creation time.  For example, if the init namespace has
vm.memfd_noexec=2, all its children namespaces will be created with 2.

[akpm@linux-foundation.org: add stub functions to fix build]
[akpm@linux-foundation.org: remove unneeded register_pid_ns_ctl_table_vm() stub, per Jeff]
[akpm@linux-foundation.org: s/pr_warn_ratelimited/pr_warn_once/, per review]
[akpm@linux-foundation.org: fix CONFIG_SYSCTL=n warning]
Link: https://lkml.kernel.org/r/20221215001205.51969-4-jeffxu@google.com
Signed-off-by: Jeff Xu &lt;jeffxu@google.com&gt;
Co-developed-by: Daniel Verkamp &lt;dverkamp@chromium.org&gt;
Signed-off-by: Daniel Verkamp &lt;dverkamp@chromium.org&gt;
Reported-by: kernel test robot &lt;lkp@intel.com&gt;
Reviewed-by: Kees Cook &lt;keescook@chromium.org&gt;
Cc: David Herrmann &lt;dh.herrmann@gmail.com&gt;
Cc: Dmitry Torokhov &lt;dmitry.torokhov@gmail.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Jorge Lucangeli Obes &lt;jorgelo@chromium.org&gt;
Cc: Shuah Khan &lt;skhan@linuxfoundation.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to set
executable bit at creation time (memfd_create).

When MFD_NOEXEC_SEAL is set, memfd is created without executable bit
(mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to be
executable (mode: 0777) after creation.

when MFD_EXEC flag is set, memfd is created with executable bit
(mode:0777), this is the same as the old behavior of memfd_create.

The new pid namespaced sysctl vm.memfd_noexec has 3 values:
0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
        MFD_EXEC was set.
1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
        MFD_NOEXEC_SEAL was set.
2: memfd_create() without MFD_NOEXEC_SEAL will be rejected.

The sysctl allows finer control of memfd_create for old-software that
doesn't set the executable bit, for example, a container with
vm.memfd_noexec=1 means the old-software will create non-executable memfd
by default.  Also, the value of memfd_noexec is passed to child namespace
at creation time.  For example, if the init namespace has
vm.memfd_noexec=2, all its children namespaces will be created with 2.

[akpm@linux-foundation.org: add stub functions to fix build]
[akpm@linux-foundation.org: remove unneeded register_pid_ns_ctl_table_vm() stub, per Jeff]
[akpm@linux-foundation.org: s/pr_warn_ratelimited/pr_warn_once/, per review]
[akpm@linux-foundation.org: fix CONFIG_SYSCTL=n warning]
Link: https://lkml.kernel.org/r/20221215001205.51969-4-jeffxu@google.com
Signed-off-by: Jeff Xu &lt;jeffxu@google.com&gt;
Co-developed-by: Daniel Verkamp &lt;dverkamp@chromium.org&gt;
Signed-off-by: Daniel Verkamp &lt;dverkamp@chromium.org&gt;
Reported-by: kernel test robot &lt;lkp@intel.com&gt;
Reviewed-by: Kees Cook &lt;keescook@chromium.org&gt;
Cc: David Herrmann &lt;dh.herrmann@gmail.com&gt;
Cc: Dmitry Torokhov &lt;dmitry.torokhov@gmail.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Jorge Lucangeli Obes &lt;jorgelo@chromium.org&gt;
Cc: Shuah Khan &lt;skhan@linuxfoundation.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm/memfd: add F_SEAL_EXEC</title>
<updated>2023-01-19T01:12:37+00:00</updated>
<author>
<name>Daniel Verkamp</name>
<email>dverkamp@chromium.org</email>
</author>
<published>2022-12-15T00:12:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=6fd7353829cafc4067aad9eea0dc95da67e7df16'/>
<id>6fd7353829cafc4067aad9eea0dc95da67e7df16</id>
<content type='text'>
Patch series "mm/memfd: introduce MFD_NOEXEC_SEAL and MFD_EXEC", v8.

Since Linux introduced the memfd feature, memfd have always had their
execute bit set, and the memfd_create() syscall doesn't allow setting it
differently.

However, in a secure by default system, such as ChromeOS, (where all
executables should come from the rootfs, which is protected by Verified
boot), this executable nature of memfd opens a door for NoExec bypass and
enables “confused deputy attack”.  E.g, in VRP bug [1]: cros_vm
process created a memfd to share the content with an external process,
however the memfd is overwritten and used for executing arbitrary code and
root escalation.  [2] lists more VRP in this kind.

On the other hand, executable memfd has its legit use, runc uses memfd’s
seal and executable feature to copy the contents of the binary then
execute them, for such system, we need a solution to differentiate runc's
use of executable memfds and an attacker's [3].

To address those above, this set of patches add following:
1&gt; Let memfd_create() set X bit at creation time.
2&gt; Let memfd to be sealed for modifying X bit.
3&gt; A new pid namespace sysctl: vm.memfd_noexec to control the behavior of
   X bit.For example, if a container has vm.memfd_noexec=2, then
   memfd_create() without MFD_NOEXEC_SEAL will be rejected.
4&gt; A new security hook in memfd_create(). This make it possible to a new
   LSM, which rejects or allows executable memfd based on its security policy.


This patch (of 5):

The new F_SEAL_EXEC flag will prevent modification of the exec bits:
written as traditional octal mask, 0111, or as named flags, S_IXUSR |
S_IXGRP | S_IXOTH.  Any chmod(2) or similar call that attempts to modify
any of these bits after the seal is applied will fail with errno EPERM.

This will preserve the execute bits as they are at the time of sealing, so
the memfd will become either permanently executable or permanently
un-executable.

Link: https://lkml.kernel.org/r/20221215001205.51969-1-jeffxu@google.com
Link: https://lkml.kernel.org/r/20221215001205.51969-2-jeffxu@google.com
Signed-off-by: Daniel Verkamp &lt;dverkamp@chromium.org&gt;
Co-developed-by: Jeff Xu &lt;jeffxu@google.com&gt;
Signed-off-by: Jeff Xu &lt;jeffxu@google.com&gt;
Reviewed-by: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Dmitry Torokhov &lt;dmitry.torokhov@gmail.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Jorge Lucangeli Obes &lt;jorgelo@chromium.org&gt;
Cc: Shuah Khan &lt;skhan@linuxfoundation.org&gt;
Cc: David Herrmann &lt;dh.herrmann@gmail.com&gt;
Cc: kernel test robot &lt;lkp@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Patch series "mm/memfd: introduce MFD_NOEXEC_SEAL and MFD_EXEC", v8.

Since Linux introduced the memfd feature, memfd have always had their
execute bit set, and the memfd_create() syscall doesn't allow setting it
differently.

However, in a secure by default system, such as ChromeOS, (where all
executables should come from the rootfs, which is protected by Verified
boot), this executable nature of memfd opens a door for NoExec bypass and
enables “confused deputy attack”.  E.g, in VRP bug [1]: cros_vm
process created a memfd to share the content with an external process,
however the memfd is overwritten and used for executing arbitrary code and
root escalation.  [2] lists more VRP in this kind.

On the other hand, executable memfd has its legit use, runc uses memfd’s
seal and executable feature to copy the contents of the binary then
execute them, for such system, we need a solution to differentiate runc's
use of executable memfds and an attacker's [3].

To address those above, this set of patches add following:
1&gt; Let memfd_create() set X bit at creation time.
2&gt; Let memfd to be sealed for modifying X bit.
3&gt; A new pid namespace sysctl: vm.memfd_noexec to control the behavior of
   X bit.For example, if a container has vm.memfd_noexec=2, then
   memfd_create() without MFD_NOEXEC_SEAL will be rejected.
4&gt; A new security hook in memfd_create(). This make it possible to a new
   LSM, which rejects or allows executable memfd based on its security policy.


This patch (of 5):

The new F_SEAL_EXEC flag will prevent modification of the exec bits:
written as traditional octal mask, 0111, or as named flags, S_IXUSR |
S_IXGRP | S_IXOTH.  Any chmod(2) or similar call that attempts to modify
any of these bits after the seal is applied will fail with errno EPERM.

This will preserve the execute bits as they are at the time of sealing, so
the memfd will become either permanently executable or permanently
un-executable.

Link: https://lkml.kernel.org/r/20221215001205.51969-1-jeffxu@google.com
Link: https://lkml.kernel.org/r/20221215001205.51969-2-jeffxu@google.com
Signed-off-by: Daniel Verkamp &lt;dverkamp@chromium.org&gt;
Co-developed-by: Jeff Xu &lt;jeffxu@google.com&gt;
Signed-off-by: Jeff Xu &lt;jeffxu@google.com&gt;
Reviewed-by: Kees Cook &lt;keescook@chromium.org&gt;
Cc: Dmitry Torokhov &lt;dmitry.torokhov@gmail.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Jorge Lucangeli Obes &lt;jorgelo@chromium.org&gt;
Cc: Shuah Khan &lt;skhan@linuxfoundation.org&gt;
Cc: David Herrmann &lt;dh.herrmann@gmail.com&gt;
Cc: kernel test robot &lt;lkp@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>memfd: fix F_SEAL_WRITE after shmem huge page allocated</title>
<updated>2022-03-05T19:08:32+00:00</updated>
<author>
<name>Hugh Dickins</name>
<email>hughd@google.com</email>
</author>
<published>2022-03-05T04:29:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=f2b277c4d1c63a85127e8aa2588e9cc3bd21cb99'/>
<id>f2b277c4d1c63a85127e8aa2588e9cc3bd21cb99</id>
<content type='text'>
Wangyong reports: after enabling tmpfs filesystem to support transparent
hugepage with the following command:

  echo always &gt; /sys/kernel/mm/transparent_hugepage/shmem_enabled

the docker program tries to add F_SEAL_WRITE through the following
command, but it fails unexpectedly with errno EBUSY:

  fcntl(5, F_ADD_SEALS, F_SEAL_WRITE) = -1.

That is because memfd_tag_pins() and memfd_wait_for_pins() were never
updated for shmem huge pages: checking page_mapcount() against
page_count() is hopeless on THP subpages - they need to check
total_mapcount() against page_count() on THP heads only.

Make memfd_tag_pins() (compared &gt; 1) as strict as memfd_wait_for_pins()
(compared != 1): either can be justified, but given the non-atomic
total_mapcount() calculation, it is better now to be strict.  Bear in
mind that total_mapcount() itself scans all of the THP subpages, when
choosing to take an XA_CHECK_SCHED latency break.

Also fix the unlikely xa_is_value() case in memfd_wait_for_pins(): if a
page has been swapped out since memfd_tag_pins(), then its refcount must
have fallen, and so it can safely be untagged.

Link: https://lkml.kernel.org/r/a4f79248-df75-2c8c-3df-ba3317ccb5da@google.com
Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Reported-by: Zeal Robot &lt;zealci@zte.com.cn&gt;
Reported-by: wangyong &lt;wang.yong12@zte.com.cn&gt;
Cc: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: CGEL ZTE &lt;cgel.zte@gmail.com&gt;
Cc: Kirill A. Shutemov &lt;kirill@shutemov.name&gt;
Cc: Song Liu &lt;songliubraving@fb.com&gt;
Cc: Yang Yang &lt;yang.yang29@zte.com.cn&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Wangyong reports: after enabling tmpfs filesystem to support transparent
hugepage with the following command:

  echo always &gt; /sys/kernel/mm/transparent_hugepage/shmem_enabled

the docker program tries to add F_SEAL_WRITE through the following
command, but it fails unexpectedly with errno EBUSY:

  fcntl(5, F_ADD_SEALS, F_SEAL_WRITE) = -1.

That is because memfd_tag_pins() and memfd_wait_for_pins() were never
updated for shmem huge pages: checking page_mapcount() against
page_count() is hopeless on THP subpages - they need to check
total_mapcount() against page_count() on THP heads only.

Make memfd_tag_pins() (compared &gt; 1) as strict as memfd_wait_for_pins()
(compared != 1): either can be justified, but given the non-atomic
total_mapcount() calculation, it is better now to be strict.  Bear in
mind that total_mapcount() itself scans all of the THP subpages, when
choosing to take an XA_CHECK_SCHED latency break.

Also fix the unlikely xa_is_value() case in memfd_wait_for_pins(): if a
page has been swapped out since memfd_tag_pins(), then its refcount must
have fallen, and so it can safely be untagged.

Link: https://lkml.kernel.org/r/a4f79248-df75-2c8c-3df-ba3317ccb5da@google.com
Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Reported-by: Zeal Robot &lt;zealci@zte.com.cn&gt;
Reported-by: wangyong &lt;wang.yong12@zte.com.cn&gt;
Cc: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: CGEL ZTE &lt;cgel.zte@gmail.com&gt;
Cc: Kirill A. Shutemov &lt;kirill@shutemov.name&gt;
Cc: Song Liu &lt;songliubraving@fb.com&gt;
Cc: Yang Yang &lt;yang.yang29@zte.com.cn&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm,hugetlb: remove mlock ulimit for SHM_HUGETLB</title>
<updated>2021-11-09T18:02:48+00:00</updated>
<author>
<name>zhangyiru</name>
<email>zhangyiru3@huawei.com</email>
</author>
<published>2021-11-09T02:31:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=83c1fd763b3220458652f7d656d5047292ebcde4'/>
<id>83c1fd763b3220458652f7d656d5047292ebcde4</id>
<content type='text'>
Commit 21a3c273f88c ("mm, hugetlb: add thread name and pid to
SHM_HUGETLB mlock rlimit warning") marked this as deprecated in 2012,
but it is not deleted yet.

Mike says he still sees that message in log files on occasion, so maybe we
should preserve this warning.

Also remove hugetlbfs related user_shm_unlock in ipc/shm.c and remove the
user_shm_unlock after out.

Link: https://lkml.kernel.org/r/20211103105857.25041-1-zhangyiru3@huawei.com
Signed-off-by: zhangyiru &lt;zhangyiru3@huawei.com&gt;
Reviewed-by: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Liu Zixian &lt;liuzixian4@huawei.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: wuxu.wu &lt;wuxu.wu@huawei.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Commit 21a3c273f88c ("mm, hugetlb: add thread name and pid to
SHM_HUGETLB mlock rlimit warning") marked this as deprecated in 2012,
but it is not deleted yet.

Mike says he still sees that message in log files on occasion, so maybe we
should preserve this warning.

Also remove hugetlbfs related user_shm_unlock in ipc/shm.c and remove the
user_shm_unlock after out.

Link: https://lkml.kernel.org/r/20211103105857.25041-1-zhangyiru3@huawei.com
Signed-off-by: zhangyiru &lt;zhangyiru3@huawei.com&gt;
Reviewed-by: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Liu Zixian &lt;liuzixian4@huawei.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: wuxu.wu &lt;wuxu.wu@huawei.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Reimplement RLIMIT_MEMLOCK on top of ucounts</title>
<updated>2021-04-30T19:14:02+00:00</updated>
<author>
<name>Alexey Gladkov</name>
<email>legion@kernel.org</email>
</author>
<published>2021-04-22T12:27:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=d7c9e99aee48e6bc0b427f3e3c658a6aba15001e'/>
<id>d7c9e99aee48e6bc0b427f3e3c658a6aba15001e</id>
<content type='text'>
The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Changelog

v11:
* Fix issue found by lkp robot.

v8:
* Fix issues found by lkp-tests project.

v7:
* Keep only ucounts for RLIMIT_MEMLOCK checks instead of struct cred.

v6:
* Fix bug in hugetlb_file_setup() detected by trinity.

Reported-by: kernel test robot &lt;oliver.sang@intel.com&gt;
Reported-by: kernel test robot &lt;lkp@intel.com&gt;
Signed-off-by: Alexey Gladkov &lt;legion@kernel.org&gt;
Link: https://lkml.kernel.org/r/970d50c70c71bfd4496e0e8d2a0a32feebebb350.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The rlimit counter is tied to uid in the user_namespace. This allows
rlimit values to be specified in userns even if they are already
globally exceeded by the user. However, the value of the previous
user_namespaces cannot be exceeded.

Changelog

v11:
* Fix issue found by lkp robot.

v8:
* Fix issues found by lkp-tests project.

v7:
* Keep only ucounts for RLIMIT_MEMLOCK checks instead of struct cred.

v6:
* Fix bug in hugetlb_file_setup() detected by trinity.

Reported-by: kernel test robot &lt;oliver.sang@intel.com&gt;
Reported-by: kernel test robot &lt;lkp@intel.com&gt;
Signed-off-by: Alexey Gladkov &lt;legion@kernel.org&gt;
Link: https://lkml.kernel.org/r/970d50c70c71bfd4496e0e8d2a0a32feebebb350.1619094428.git.legion@kernel.org
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: page cache: store only head pages in i_pages</title>
<updated>2019-09-24T22:54:08+00:00</updated>
<author>
<name>Matthew Wilcox (Oracle)</name>
<email>willy@infradead.org</email>
</author>
<published>2019-09-23T22:34:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=4101196b19d7f905dca5dcf46cd35eb758cf06c0'/>
<id>4101196b19d7f905dca5dcf46cd35eb758cf06c0</id>
<content type='text'>
Transparent Huge Pages are currently stored in i_pages as pointers to
consecutive subpages.  This patch changes that to storing consecutive
pointers to the head page in preparation for storing huge pages more
efficiently in i_pages.

Large parts of this are "inspired" by Kirill's patch
https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/

Kirill and Huang Ying contributed several fixes.

[willy@infradead.org: use compound_nr, squish uninit-var warning]
Link: http://lkml.kernel.org/r/20190731210400.7419-1-willy@infradead.org
Signed-off-by: Matthew Wilcox &lt;willy@infradead.org&gt;
Acked-by: Jan Kara &lt;jack@suse.cz&gt;
Reviewed-by: Kirill Shutemov &lt;kirill@shutemov.name&gt;
Reviewed-by: Song Liu &lt;songliubraving@fb.com&gt;
Tested-by: Song Liu &lt;songliubraving@fb.com&gt;
Tested-by: William Kucharski &lt;william.kucharski@oracle.com&gt;
Reviewed-by: William Kucharski &lt;william.kucharski@oracle.com&gt;
Tested-by: Qian Cai &lt;cai@lca.pw&gt;
Tested-by: Mikhail Gavrilov &lt;mikhail.v.gavrilov@gmail.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Chris Wilson &lt;chris@chris-wilson.co.uk&gt;
Cc: Song Liu &lt;songliubraving@fb.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Transparent Huge Pages are currently stored in i_pages as pointers to
consecutive subpages.  This patch changes that to storing consecutive
pointers to the head page in preparation for storing huge pages more
efficiently in i_pages.

Large parts of this are "inspired" by Kirill's patch
https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/

Kirill and Huang Ying contributed several fixes.

[willy@infradead.org: use compound_nr, squish uninit-var warning]
Link: http://lkml.kernel.org/r/20190731210400.7419-1-willy@infradead.org
Signed-off-by: Matthew Wilcox &lt;willy@infradead.org&gt;
Acked-by: Jan Kara &lt;jack@suse.cz&gt;
Reviewed-by: Kirill Shutemov &lt;kirill@shutemov.name&gt;
Reviewed-by: Song Liu &lt;songliubraving@fb.com&gt;
Tested-by: Song Liu &lt;songliubraving@fb.com&gt;
Tested-by: William Kucharski &lt;william.kucharski@oracle.com&gt;
Reviewed-by: William Kucharski &lt;william.kucharski@oracle.com&gt;
Tested-by: Qian Cai &lt;cai@lca.pw&gt;
Tested-by: Mikhail Gavrilov &lt;mikhail.v.gavrilov@gmail.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Chris Wilson &lt;chris@chris-wilson.co.uk&gt;
Cc: Song Liu &lt;songliubraving@fb.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Revert "mm: page cache: store only head pages in i_pages"</title>
<updated>2019-07-06T02:55:18+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2019-07-06T02:55:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=69bf4b6b54fb7f52b7ea9ce28d4a360cd5ec956d'/>
<id>69bf4b6b54fb7f52b7ea9ce28d4a360cd5ec956d</id>
<content type='text'>
This reverts commit 5fd4ca2d84b249f0858ce28cf637cf25b61a398f.

Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in
__delete_from_swap_cache() to trigger:

   page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363
   anon
   flags: 0x17fffe00080034(uptodate|lru|active|swapbacked)
   raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689
   raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000
   page dumped because: VM_BUG_ON_PAGE(entry != page)
   page-&gt;mem_cgroup:ffff978433ace000
   ------------[ cut here ]------------
   kernel BUG at mm/swap_state.c:170!
   invalid opcode: 0000 [#1] SMP NOPTI
   CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1
   Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019
   RIP: 0010:__delete_from_swap_cache+0x20d/0x240
   Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff &lt;0f&gt; 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f
   RSP: 0018:ffffa982036e7980 EFLAGS: 00010046
   RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006
   RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900
   RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535
   R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120
   R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000
   FS:  0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0
   Call Trace:
    delete_from_swap_cache+0x46/0xa0
    try_to_free_swap+0xbc/0x110
    swap_writepage+0x13/0x70
    pageout.isra.0+0x13c/0x350
    shrink_page_list+0xc14/0xdf0
    shrink_inactive_list+0x1e5/0x3c0
    shrink_node_memcg+0x202/0x760
    shrink_node+0xe0/0x470
    balance_pgdat+0x2d1/0x510
    kswapd+0x220/0x420
    kthread+0xfb/0x130
    ret_from_fork+0x22/0x40

and it's not immediately obvious why it happens.  It's too late in the
rc cycle to do anything but revert for now.

Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/
Reported-and-bisected-by: Mikhail Gavrilov &lt;mikhail.v.gavrilov@gmail.com&gt;
Suggested-by: Jan Kara &lt;jack@suse.cz&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Kirill Shutemov &lt;kirill@shutemov.name&gt;
Cc: William Kucharski &lt;william.kucharski@oracle.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This reverts commit 5fd4ca2d84b249f0858ce28cf637cf25b61a398f.

Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in
__delete_from_swap_cache() to trigger:

   page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363
   anon
   flags: 0x17fffe00080034(uptodate|lru|active|swapbacked)
   raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689
   raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000
   page dumped because: VM_BUG_ON_PAGE(entry != page)
   page-&gt;mem_cgroup:ffff978433ace000
   ------------[ cut here ]------------
   kernel BUG at mm/swap_state.c:170!
   invalid opcode: 0000 [#1] SMP NOPTI
   CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1
   Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019
   RIP: 0010:__delete_from_swap_cache+0x20d/0x240
   Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff &lt;0f&gt; 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f
   RSP: 0018:ffffa982036e7980 EFLAGS: 00010046
   RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006
   RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900
   RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535
   R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120
   R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000
   FS:  0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0
   Call Trace:
    delete_from_swap_cache+0x46/0xa0
    try_to_free_swap+0xbc/0x110
    swap_writepage+0x13/0x70
    pageout.isra.0+0x13c/0x350
    shrink_page_list+0xc14/0xdf0
    shrink_inactive_list+0x1e5/0x3c0
    shrink_node_memcg+0x202/0x760
    shrink_node+0xe0/0x470
    balance_pgdat+0x2d1/0x510
    kswapd+0x220/0x420
    kthread+0xfb/0x130
    ret_from_fork+0x22/0x40

and it's not immediately obvious why it happens.  It's too late in the
rc cycle to do anything but revert for now.

Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/
Reported-and-bisected-by: Mikhail Gavrilov &lt;mikhail.v.gavrilov@gmail.com&gt;
Suggested-by: Jan Kara &lt;jack@suse.cz&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: Kirill Shutemov &lt;kirill@shutemov.name&gt;
Cc: William Kucharski &lt;william.kucharski@oracle.com&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
