<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/arch/um/kernel/trap.c, branch v7.0</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>um: Add initial SMP support</title>
<updated>2025-10-27T15:41:15+00:00</updated>
<author>
<name>Tiwei Bie</name>
<email>tiwei.btw@antgroup.com</email>
</author>
<published>2025-10-27T00:18:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=1e4ee5135d814fe4785890790cec81c3132888fb'/>
<id>1e4ee5135d814fe4785890790cec81c3132888fb</id>
<content type='text'>
Add initial symmetric multi-processing (SMP) support to UML. With
this support enabled, users can tell UML to start multiple virtual
processors, each represented as a separate host thread.

In UML, kthreads and normal threads (when running in kernel mode)
can be scheduled and executed simultaneously on different virtual
processors. However, the userspace code of normal threads still
runs within their respective single-threaded stubs.

That is, SMP support is currently available both within the kernel
and across different processes, but still remains limited within
threads of the same process in userspace.

Signed-off-by: Tiwei Bie &lt;tiwei.btw@antgroup.com&gt;
Link: https://patch.msgid.link/20251027001815.1666872-6-tiwei.bie@linux.dev
Signed-off-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Add initial symmetric multi-processing (SMP) support to UML. With
this support enabled, users can tell UML to start multiple virtual
processors, each represented as a separate host thread.

In UML, kthreads and normal threads (when running in kernel mode)
can be scheduled and executed simultaneously on different virtual
processors. However, the userspace code of normal threads still
runs within their respective single-threaded stubs.

That is, SMP support is currently available both within the kernel
and across different processes, but still remains limited within
threads of the same process in userspace.

Signed-off-by: Tiwei Bie &lt;tiwei.btw@antgroup.com&gt;
Link: https://patch.msgid.link/20251027001815.1666872-6-tiwei.bie@linux.dev
Signed-off-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge tag 'uml-for-linux-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux</title>
<updated>2025-06-05T18:45:33+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2025-06-05T18:45:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=cfc4ca8986bb1f6182da6cd7bb57f228590b4643'/>
<id>cfc4ca8986bb1f6182da6cd7bb57f228590b4643</id>
<content type='text'>
Pull UML updates from Johannes Berg:
 "The only really new thing is the long-standing seccomp work
  (originally from 2021!). Wven if it still isn't enabled by default due
  to security concerns it can still be used e.g. for tests.

   - remove obsolete network transports

   - remove PCI IO port support

   - start adding seccomp-based process handling instead of ptrace"

* tag 'uml-for-linux-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux: (29 commits)
  um: remove "extern" from implementation of sigchld_handler
  um: fix unused variable warning
  um: fix SECCOMP 32bit xstate register restore
  um: pass FD for memory operations when needed
  um: Add SECCOMP support detection and initialization
  um: Implement kernel side of SECCOMP based process handling
  um: Track userspace children dying in SECCOMP mode
  um: Add helper functions to get/set state for SECCOMP
  um: Add stub side of SECCOMP/futex based process handling
  um: Move faultinfo extraction into userspace routine
  um: vector: Use mac_pton() for MAC address parsing
  um: vector: Clean up and modernize log messages
  um: chan_kern: use raw spinlock for irqs_to_free_lock
  MAINTAINERS: remove obsolete file entry in TUN/TAP DRIVER
  um: Fix tgkill compile error on old host OSes
  um: stop using PCI port I/O
  um: Remove legacy network transport infrastructure
  um: vector: Eliminate the dependency on uml_net
  um: Remove obsolete legacy network transports
  um/asm: Replace "REP; NOP" with PAUSE mnemonic
  ...
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull UML updates from Johannes Berg:
 "The only really new thing is the long-standing seccomp work
  (originally from 2021!). Wven if it still isn't enabled by default due
  to security concerns it can still be used e.g. for tests.

   - remove obsolete network transports

   - remove PCI IO port support

   - start adding seccomp-based process handling instead of ptrace"

* tag 'uml-for-linux-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux: (29 commits)
  um: remove "extern" from implementation of sigchld_handler
  um: fix unused variable warning
  um: fix SECCOMP 32bit xstate register restore
  um: pass FD for memory operations when needed
  um: Add SECCOMP support detection and initialization
  um: Implement kernel side of SECCOMP based process handling
  um: Track userspace children dying in SECCOMP mode
  um: Add helper functions to get/set state for SECCOMP
  um: Add stub side of SECCOMP/futex based process handling
  um: Move faultinfo extraction into userspace routine
  um: vector: Use mac_pton() for MAC address parsing
  um: vector: Clean up and modernize log messages
  um: chan_kern: use raw spinlock for irqs_to_free_lock
  MAINTAINERS: remove obsolete file entry in TUN/TAP DRIVER
  um: Fix tgkill compile error on old host OSes
  um: stop using PCI port I/O
  um: Remove legacy network transport infrastructure
  um: vector: Eliminate the dependency on uml_net
  um: Remove obsolete legacy network transports
  um/asm: Replace "REP; NOP" with PAUSE mnemonic
  ...
</pre>
</div>
</content>
</entry>
<entry>
<title>um: use proper care when taking mmap lock during segfault</title>
<updated>2025-05-05T08:24:58+00:00</updated>
<author>
<name>Benjamin Berg</name>
<email>benjamin.berg@intel.com</email>
</author>
<published>2025-04-08T07:45:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=6767e8784cd2e8b386a62330ea6864949d983a3e'/>
<id>6767e8784cd2e8b386a62330ea6864949d983a3e</id>
<content type='text'>
Segfaults can occur at times where the mmap lock cannot be taken. If
that happens the segfault handler may not be able to take the mmap lock.

Fix the code to use the same approach as most other architectures.
Unfortunately, this requires copying code from mm/memory.c and modifying
it slightly as UML does not have exception tables.

Signed-off-by: Benjamin Berg &lt;benjamin.berg@intel.com&gt;
Link: https://patch.msgid.link/20250408074524.300153-2-benjamin@sipsolutions.net
Signed-off-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Segfaults can occur at times where the mmap lock cannot be taken. If
that happens the segfault handler may not be able to take the mmap lock.

Fix the code to use the same approach as most other architectures.
Unfortunately, this requires copying code from mm/memory.c and modifying
it slightly as UML does not have exception tables.

Signed-off-by: Benjamin Berg &lt;benjamin.berg@intel.com&gt;
Link: https://patch.msgid.link/20250408074524.300153-2-benjamin@sipsolutions.net
Signed-off-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>um: Remove duplicate arch.h header</title>
<updated>2025-05-05T08:24:37+00:00</updated>
<author>
<name>Jiapeng Chong</name>
<email>jiapeng.chong@linux.alibaba.com</email>
</author>
<published>2025-03-31T08:31:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=82c8e1280c74f80fbacb9efbba6dd7b23446e536'/>
<id>82c8e1280c74f80fbacb9efbba6dd7b23446e536</id>
<content type='text'>
./arch/um/kernel/trap.c: arch.h is included more than once.

Reported-by: Abaci Robot &lt;abaci@linux.alibaba.com&gt;
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=19877
Signed-off-by: Jiapeng Chong &lt;jiapeng.chong@linux.alibaba.com&gt;
Link: https://patch.msgid.link/20250331083150.72598-1-jiapeng.chong@linux.alibaba.com
Signed-off-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
./arch/um/kernel/trap.c: arch.h is included more than once.

Reported-by: Abaci Robot &lt;abaci@linux.alibaba.com&gt;
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=19877
Signed-off-by: Jiapeng Chong &lt;jiapeng.chong@linux.alibaba.com&gt;
Link: https://patch.msgid.link/20250331083150.72598-1-jiapeng.chong@linux.alibaba.com
Signed-off-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>um: fix _nofault accesses</title>
<updated>2025-05-05T08:06:51+00:00</updated>
<author>
<name>Johannes Berg</name>
<email>johannes.berg@intel.com</email>
</author>
<published>2025-04-04T15:05:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=68025adfc13e6cd15eebe2293f77659f47daf13b'/>
<id>68025adfc13e6cd15eebe2293f77659f47daf13b</id>
<content type='text'>
Nathan reported [1] that when built with clang, the um kernel
crashes pretty much immediately. This turned out to be an issue
with the inline assembly I had added, when clang used %rax/%eax
for both operands. Reorder it so current-&gt;thread.segv_continue
is written first, and then the lifetime of _faulted won't have
overlap with the lifetime of segv_continue.

In the email thread Benjamin also pointed out that current-&gt;mm
is only NULL for true kernel tasks, but we could do this for a
userspace task, so the current-&gt;thread.segv_continue logic must
be lifted out of the mm==NULL check.

Finally, while looking at this, put a barrier() so the NULL
assignment to thread.segv_continue cannot be reorder before
the possibly faulting operation.

Reported-by: Nathan Chancellor &lt;nathan@kernel.org&gt;
Closes: https://lore.kernel.org/r/20250402221254.GA384@ax162 [1]
Fixes: d1d7f01f7cd3 ("um: mark rodata read-only and implement _nofault accesses")
Tested-by: Nathan Chancellor &lt;nathan@kernel.org&gt;
Signed-off-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Nathan reported [1] that when built with clang, the um kernel
crashes pretty much immediately. This turned out to be an issue
with the inline assembly I had added, when clang used %rax/%eax
for both operands. Reorder it so current-&gt;thread.segv_continue
is written first, and then the lifetime of _faulted won't have
overlap with the lifetime of segv_continue.

In the email thread Benjamin also pointed out that current-&gt;mm
is only NULL for true kernel tasks, but we could do this for a
userspace task, so the current-&gt;thread.segv_continue logic must
be lifted out of the mm==NULL check.

Finally, while looking at this, put a barrier() so the NULL
assignment to thread.segv_continue cannot be reorder before
the possibly faulting operation.

Reported-by: Nathan Chancellor &lt;nathan@kernel.org&gt;
Closes: https://lore.kernel.org/r/20250402221254.GA384@ax162 [1]
Fixes: d1d7f01f7cd3 ("um: mark rodata read-only and implement _nofault accesses")
Tested-by: Nathan Chancellor &lt;nathan@kernel.org&gt;
Signed-off-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>um: mark rodata read-only and implement _nofault accesses</title>
<updated>2025-03-18T10:03:14+00:00</updated>
<author>
<name>Johannes Berg</name>
<email>johannes.berg@intel.com</email>
</author>
<published>2025-02-10T16:09:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=d1d7f01f7cd35e16c6bcef5a0e31988b5c9980f9'/>
<id>d1d7f01f7cd35e16c6bcef5a0e31988b5c9980f9</id>
<content type='text'>
Mark read-only data actually read-only (simple mprotect), and
to be able to test it also implement _nofault accesses. This
works by setting up a new "segv_continue" pointer in current,
and then when we hit a segfault we change the signal return
context so that we continue at that address. The code using
this sets it up so that it jumps to a label and then aborts
the access that way, returning -EFAULT.

It's possible to optimize the ___backtrack_faulted() thing by
using asm goto (compiler version dependent) and/or gcc's (not
sure if clang has it) &amp;&amp;label extension, but at least in one
attempt I made the &amp;&amp; caused the compiler to not load -EFAULT
into the register in case of jumping to the &amp;&amp;label from the
fault handler. So leave it like this for now.

Signed-off-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
Co-developed-by: Benjamin Berg &lt;benjamin.berg@intel.com&gt;
Signed-off-by: Benjamin Berg &lt;benjamin.berg@intel.com&gt;
Link: https://patch.msgid.link/20250210160926.420133-2-benjamin@sipsolutions.net
Signed-off-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Mark read-only data actually read-only (simple mprotect), and
to be able to test it also implement _nofault accesses. This
works by setting up a new "segv_continue" pointer in current,
and then when we hit a segfault we change the signal return
context so that we continue at that address. The code using
this sets it up so that it jumps to a label and then aborts
the access that way, returning -EFAULT.

It's possible to optimize the ___backtrack_faulted() thing by
using asm goto (compiler version dependent) and/or gcc's (not
sure if clang has it) &amp;&amp;label extension, but at least in one
attempt I made the &amp;&amp; caused the compiler to not load -EFAULT
into the register in case of jumping to the &amp;&amp;label from the
fault handler. So leave it like this for now.

Signed-off-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
Co-developed-by: Benjamin Berg &lt;benjamin.berg@intel.com&gt;
Signed-off-by: Benjamin Berg &lt;benjamin.berg@intel.com&gt;
Link: https://patch.msgid.link/20250210160926.420133-2-benjamin@sipsolutions.net
Signed-off-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>um: remove fault_catcher infrastructure</title>
<updated>2024-10-23T07:52:46+00:00</updated>
<author>
<name>Johannes Berg</name>
<email>johannes.berg@intel.com</email>
</author>
<published>2024-10-10T20:45:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=188b64f288a434bed3ef21ec59f00c996ecb0346'/>
<id>188b64f288a434bed3ef21ec59f00c996ecb0346</id>
<content type='text'>
This was perhaps intended to do _nofault copies, but the
real reason is lost to history. Remove this, it's not
needed, and using longjmp() out of the middle of the
signal handler with all the state it has modified is
not going to be a good idea anyway.

Link: https://patch.msgid.link/20241010224513.901c4d390b3e.Ia74742668b44603c1ca23dd36f90e964e6e7ee55@changeid
Signed-off-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This was perhaps intended to do _nofault copies, but the
real reason is lost to history. Remove this, it's not
needed, and using longjmp() out of the middle of the
signal handler with all the state it has modified is
not going to be a good idea anyway.

Link: https://patch.msgid.link/20241010224513.901c4d390b3e.Ia74742668b44603c1ca23dd36f90e964e6e7ee55@changeid
Signed-off-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>um: refactor TLB update handling</title>
<updated>2024-07-03T15:09:50+00:00</updated>
<author>
<name>Benjamin Berg</name>
<email>benjamin.berg@intel.com</email>
</author>
<published>2024-07-03T13:45:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=bcf3d957c63d8b6d718b862fea18c5f14ce803e2'/>
<id>bcf3d957c63d8b6d718b862fea18c5f14ce803e2</id>
<content type='text'>
Conceptually, we want the memory mappings to always be up to date and
represent whatever is in the TLB. To ensure that, we need to sync them
over in the userspace case and for the kernel we need to process the
mappings.

The kernel will call flush_tlb_* if page table entries that were valid
before become invalid. Unfortunately, this is not the case if entries
are added.

As such, change both flush_tlb_* and set_ptes to track the memory range
that has to be synchronized. For the kernel, we need to execute a
flush_tlb_kern_* immediately but we can wait for the first page fault in
case of set_ptes. For userspace in contrast we only store that a range
of memory needs to be synced and do so whenever we switch to that
process.

Signed-off-by: Benjamin Berg &lt;benjamin.berg@intel.com&gt;
Link: https://patch.msgid.link/20240703134536.1161108-13-benjamin@sipsolutions.net
Signed-off-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Conceptually, we want the memory mappings to always be up to date and
represent whatever is in the TLB. To ensure that, we need to sync them
over in the userspace case and for the kernel we need to process the
mappings.

The kernel will call flush_tlb_* if page table entries that were valid
before become invalid. Unfortunately, this is not the case if entries
are added.

As such, change both flush_tlb_* and set_ptes to track the memory range
that has to be synchronized. For the kernel, we need to execute a
flush_tlb_kern_* immediately but we can wait for the first page fault in
case of set_ptes. For userspace in contrast we only store that a range
of memory needs to be synced and do so whenever we switch to that
process.

Signed-off-by: Benjamin Berg &lt;benjamin.berg@intel.com&gt;
Link: https://patch.msgid.link/20240703134536.1161108-13-benjamin@sipsolutions.net
Signed-off-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: always expand the stack with the mmap write lock held</title>
<updated>2023-06-27T16:41:30+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-06-24T20:45:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=8d7071af890768438c14db6172cc8f9f4d04e184'/>
<id>8d7071af890768438c14db6172cc8f9f4d04e184</id>
<content type='text'>
This finishes the job of always holding the mmap write lock when
extending the user stack vma, and removes the 'write_locked' argument
from the vm helper functions again.

For some cases, we just avoid expanding the stack at all: drivers and
page pinning really shouldn't be extending any stacks.  Let's see if any
strange users really wanted that.

It's worth noting that architectures that weren't converted to the new
lock_mm_and_find_vma() helper function are left using the legacy
"expand_stack()" function, but it has been changed to drop the mmap_lock
and take it for writing while expanding the vma.  This makes it fairly
straightforward to convert the remaining architectures.

As a result of dropping and re-taking the lock, the calling conventions
for this function have also changed, since the old vma may no longer be
valid.  So it will now return the new vma if successful, and NULL - and
the lock dropped - if the area could not be extended.

Tested-by: Vegard Nossum &lt;vegard.nossum@oracle.com&gt;
Tested-by: John Paul Adrian Glaubitz &lt;glaubitz@physik.fu-berlin.de&gt; # ia64
Tested-by: Frank Scheiner &lt;frank.scheiner@web.de&gt; # ia64
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This finishes the job of always holding the mmap write lock when
extending the user stack vma, and removes the 'write_locked' argument
from the vm helper functions again.

For some cases, we just avoid expanding the stack at all: drivers and
page pinning really shouldn't be extending any stacks.  Let's see if any
strange users really wanted that.

It's worth noting that architectures that weren't converted to the new
lock_mm_and_find_vma() helper function are left using the legacy
"expand_stack()" function, but it has been changed to drop the mmap_lock
and take it for writing while expanding the vma.  This makes it fairly
straightforward to convert the remaining architectures.

As a result of dropping and re-taking the lock, the calling conventions
for this function have also changed, since the old vma may no longer be
valid.  So it will now return the new vma if successful, and NULL - and
the lock dropped - if the area could not be extended.

Tested-by: Vegard Nossum &lt;vegard.nossum@oracle.com&gt;
Tested-by: John Paul Adrian Glaubitz &lt;glaubitz@physik.fu-berlin.de&gt; # ia64
Tested-by: Frank Scheiner &lt;frank.scheiner@web.de&gt; # ia64
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: avoid unnecessary page fault retires on shared memory types</title>
<updated>2022-06-17T02:48:27+00:00</updated>
<author>
<name>Peter Xu</name>
<email>peterx@redhat.com</email>
</author>
<published>2022-05-30T18:34:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=d92725256b4f22d084b813b37ddc394da79aacab'/>
<id>d92725256b4f22d084b813b37ddc394da79aacab</id>
<content type='text'>
I observed that for each of the shared file-backed page faults, we're very
likely to retry one more time for the 1st write fault upon no page.  It's
because we'll need to release the mmap lock for dirty rate limit purpose
with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).

Then after that throttling we return VM_FAULT_RETRY.

We did that probably because VM_FAULT_RETRY is the only way we can return
to the fault handler at that time telling it we've released the mmap lock.

However that's not ideal because it's very likely the fault does not need
to be retried at all since the pgtable was well installed before the
throttling, so the next continuous fault (including taking mmap read lock,
walk the pgtable, etc.) could be in most cases unnecessary.

It's not only slowing down page faults for shared file-backed, but also add
more mmap lock contention which is in most cases not needed at all.

To observe this, one could try to write to some shmem page and look at
"pgfault" value in /proc/vmstat, then we should expect 2 counts for each
shmem write simply because we retried, and vm event "pgfault" will capture
that.

To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
show that we've completed the whole fault and released the lock.  It's also
a hint that we should very possibly not need another fault immediately on
this page because we've just completed it.

This patch provides a ~12% perf boost on my aarch64 test VM with a simple
program sequentially dirtying 400MB shmem file being mmap()ed and these are
the time it needs:

  Before: 650.980 ms (+-1.94%)
  After:  569.396 ms (+-1.38%)

I believe it could help more than that.

We need some special care on GUP and the s390 pgfault handler (for gmap
code before returning from pgfault), the rest changes in the page fault
handlers should be relatively straightforward.

Another thing to mention is that mm_account_fault() does take this new
fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.

I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
them as-is.

Link: https://lkml.kernel.org/r/20220530183450.42886-1-peterx@redhat.com
Signed-off-by: Peter Xu &lt;peterx@redhat.com&gt;
Acked-by: Geert Uytterhoeven &lt;geert@linux-m68k.org&gt;
Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Vineet Gupta &lt;vgupta@kernel.org&gt;
Acked-by: Guo Ren &lt;guoren@kernel.org&gt;
Acked-by: Max Filippov &lt;jcmvbkbc@gmail.com&gt;
Acked-by: Christian Borntraeger &lt;borntraeger@linux.ibm.com&gt;
Acked-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt; (powerpc)
Acked-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Reviewed-by: Alistair Popple &lt;apopple@nvidia.com&gt;
Reviewed-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Acked-by: Russell King (Oracle) &lt;rmk+kernel@armlinux.org.uk&gt;	[arm part]
Acked-by: Heiko Carstens &lt;hca@linux.ibm.com&gt;
Cc: Vasily Gorbik &lt;gor@linux.ibm.com&gt;
Cc: Stafford Horne &lt;shorne@gmail.com&gt;
Cc: David S. Miller &lt;davem@davemloft.net&gt;
Cc: Johannes Berg &lt;johannes@sipsolutions.net&gt;
Cc: Brian Cain &lt;bcain@quicinc.com&gt;
Cc: Richard Henderson &lt;rth@twiddle.net&gt;
Cc: Richard Weinberger &lt;richard@nod.at&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Janosch Frank &lt;frankja@linux.ibm.com&gt;
Cc: Albert Ou &lt;aou@eecs.berkeley.edu&gt;
Cc: Anton Ivanov &lt;anton.ivanov@cambridgegreys.com&gt;
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: Borislav Petkov &lt;bp@alien8.de&gt;
Cc: Sven Schnelle &lt;svens@linux.ibm.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: James Bottomley &lt;James.Bottomley@HansenPartnership.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Alexander Gordeev &lt;agordeev@linux.ibm.com&gt;
Cc: Jonas Bonn &lt;jonas@southpole.se&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Michal Simek &lt;monstr@monstr.eu&gt;
Cc: Matt Turner &lt;mattst88@gmail.com&gt;
Cc: Paul Mackerras &lt;paulus@samba.org&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Cc: Palmer Dabbelt &lt;palmer@dabbelt.com&gt;
Cc: Stefan Kristiansson &lt;stefan.kristiansson@saunalahti.fi&gt;
Cc: Paul Walmsley &lt;paul.walmsley@sifive.com&gt;
Cc: Ivan Kokshaysky &lt;ink@jurassic.park.msu.ru&gt;
Cc: Chris Zankel &lt;chris@zankel.net&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Dinh Nguyen &lt;dinguyen@kernel.org&gt;
Cc: Rich Felker &lt;dalias@libc.org&gt;
Cc: H. Peter Anvin &lt;hpa@zytor.com&gt;
Cc: Andy Lutomirski &lt;luto@kernel.org&gt;
Cc: Thomas Bogendoerfer &lt;tsbogend@alpha.franken.de&gt;
Cc: Helge Deller &lt;deller@gmx.de&gt;
Cc: Yoshinori Sato &lt;ysato@users.osdn.me&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
I observed that for each of the shared file-backed page faults, we're very
likely to retry one more time for the 1st write fault upon no page.  It's
because we'll need to release the mmap lock for dirty rate limit purpose
with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).

Then after that throttling we return VM_FAULT_RETRY.

We did that probably because VM_FAULT_RETRY is the only way we can return
to the fault handler at that time telling it we've released the mmap lock.

However that's not ideal because it's very likely the fault does not need
to be retried at all since the pgtable was well installed before the
throttling, so the next continuous fault (including taking mmap read lock,
walk the pgtable, etc.) could be in most cases unnecessary.

It's not only slowing down page faults for shared file-backed, but also add
more mmap lock contention which is in most cases not needed at all.

To observe this, one could try to write to some shmem page and look at
"pgfault" value in /proc/vmstat, then we should expect 2 counts for each
shmem write simply because we retried, and vm event "pgfault" will capture
that.

To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
show that we've completed the whole fault and released the lock.  It's also
a hint that we should very possibly not need another fault immediately on
this page because we've just completed it.

This patch provides a ~12% perf boost on my aarch64 test VM with a simple
program sequentially dirtying 400MB shmem file being mmap()ed and these are
the time it needs:

  Before: 650.980 ms (+-1.94%)
  After:  569.396 ms (+-1.38%)

I believe it could help more than that.

We need some special care on GUP and the s390 pgfault handler (for gmap
code before returning from pgfault), the rest changes in the page fault
handlers should be relatively straightforward.

Another thing to mention is that mm_account_fault() does take this new
fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.

I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
them as-is.

Link: https://lkml.kernel.org/r/20220530183450.42886-1-peterx@redhat.com
Signed-off-by: Peter Xu &lt;peterx@redhat.com&gt;
Acked-by: Geert Uytterhoeven &lt;geert@linux-m68k.org&gt;
Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Vineet Gupta &lt;vgupta@kernel.org&gt;
Acked-by: Guo Ren &lt;guoren@kernel.org&gt;
Acked-by: Max Filippov &lt;jcmvbkbc@gmail.com&gt;
Acked-by: Christian Borntraeger &lt;borntraeger@linux.ibm.com&gt;
Acked-by: Michael Ellerman &lt;mpe@ellerman.id.au&gt; (powerpc)
Acked-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Reviewed-by: Alistair Popple &lt;apopple@nvidia.com&gt;
Reviewed-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Acked-by: Russell King (Oracle) &lt;rmk+kernel@armlinux.org.uk&gt;	[arm part]
Acked-by: Heiko Carstens &lt;hca@linux.ibm.com&gt;
Cc: Vasily Gorbik &lt;gor@linux.ibm.com&gt;
Cc: Stafford Horne &lt;shorne@gmail.com&gt;
Cc: David S. Miller &lt;davem@davemloft.net&gt;
Cc: Johannes Berg &lt;johannes@sipsolutions.net&gt;
Cc: Brian Cain &lt;bcain@quicinc.com&gt;
Cc: Richard Henderson &lt;rth@twiddle.net&gt;
Cc: Richard Weinberger &lt;richard@nod.at&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Janosch Frank &lt;frankja@linux.ibm.com&gt;
Cc: Albert Ou &lt;aou@eecs.berkeley.edu&gt;
Cc: Anton Ivanov &lt;anton.ivanov@cambridgegreys.com&gt;
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: Borislav Petkov &lt;bp@alien8.de&gt;
Cc: Sven Schnelle &lt;svens@linux.ibm.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: James Bottomley &lt;James.Bottomley@HansenPartnership.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Alexander Gordeev &lt;agordeev@linux.ibm.com&gt;
Cc: Jonas Bonn &lt;jonas@southpole.se&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Michal Simek &lt;monstr@monstr.eu&gt;
Cc: Matt Turner &lt;mattst88@gmail.com&gt;
Cc: Paul Mackerras &lt;paulus@samba.org&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Nicholas Piggin &lt;npiggin@gmail.com&gt;
Cc: Palmer Dabbelt &lt;palmer@dabbelt.com&gt;
Cc: Stefan Kristiansson &lt;stefan.kristiansson@saunalahti.fi&gt;
Cc: Paul Walmsley &lt;paul.walmsley@sifive.com&gt;
Cc: Ivan Kokshaysky &lt;ink@jurassic.park.msu.ru&gt;
Cc: Chris Zankel &lt;chris@zankel.net&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Dinh Nguyen &lt;dinguyen@kernel.org&gt;
Cc: Rich Felker &lt;dalias@libc.org&gt;
Cc: H. Peter Anvin &lt;hpa@zytor.com&gt;
Cc: Andy Lutomirski &lt;luto@kernel.org&gt;
Cc: Thomas Bogendoerfer &lt;tsbogend@alpha.franken.de&gt;
Cc: Helge Deller &lt;deller@gmx.de&gt;
Cc: Yoshinori Sato &lt;ysato@users.osdn.me&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
