linux.git/arch/um/kernel/trap.c, branch v7.0

um: Add initial SMP support

2025-10-27T15:41:15+00:00

Add initial symmetric multi-processing (SMP) support to UML. With
this support enabled, users can tell UML to start multiple virtual
processors, each represented as a separate host thread.

In UML, kthreads and normal threads (when running in kernel mode)
can be scheduled and executed simultaneously on different virtual
processors. However, the userspace code of normal threads still
runs within their respective single-threaded stubs.

That is, SMP support is currently available both within the kernel
and across different processes, but still remains limited within
threads of the same process in userspace.

Signed-off-by: Tiwei Bie 
Link: https://patch.msgid.link/20251027001815.1666872-6-tiwei.bie@linux.dev
Signed-off-by: Johannes Berg

Merge tag 'uml-for-linux-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux

2025-06-05T18:45:33+00:00

Pull UML updates from Johannes Berg:
 "The only really new thing is the long-standing seccomp work
  (originally from 2021!). Wven if it still isn't enabled by default due
  to security concerns it can still be used e.g. for tests.

   - remove obsolete network transports

   - remove PCI IO port support

   - start adding seccomp-based process handling instead of ptrace"

* tag 'uml-for-linux-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux: (29 commits)
  um: remove "extern" from implementation of sigchld_handler
  um: fix unused variable warning
  um: fix SECCOMP 32bit xstate register restore
  um: pass FD for memory operations when needed
  um: Add SECCOMP support detection and initialization
  um: Implement kernel side of SECCOMP based process handling
  um: Track userspace children dying in SECCOMP mode
  um: Add helper functions to get/set state for SECCOMP
  um: Add stub side of SECCOMP/futex based process handling
  um: Move faultinfo extraction into userspace routine
  um: vector: Use mac_pton() for MAC address parsing
  um: vector: Clean up and modernize log messages
  um: chan_kern: use raw spinlock for irqs_to_free_lock
  MAINTAINERS: remove obsolete file entry in TUN/TAP DRIVER
  um: Fix tgkill compile error on old host OSes
  um: stop using PCI port I/O
  um: Remove legacy network transport infrastructure
  um: vector: Eliminate the dependency on uml_net
  um: Remove obsolete legacy network transports
  um/asm: Replace "REP; NOP" with PAUSE mnemonic
  ...

um: use proper care when taking mmap lock during segfault

2025-05-05T08:24:58+00:00

Segfaults can occur at times where the mmap lock cannot be taken. If
that happens the segfault handler may not be able to take the mmap lock.

Fix the code to use the same approach as most other architectures.
Unfortunately, this requires copying code from mm/memory.c and modifying
it slightly as UML does not have exception tables.

Signed-off-by: Benjamin Berg 
Link: https://patch.msgid.link/20250408074524.300153-2-benjamin@sipsolutions.net
Signed-off-by: Johannes Berg

um: Remove duplicate arch.h header

2025-05-05T08:24:37+00:00

./arch/um/kernel/trap.c: arch.h is included more than once.

Reported-by: Abaci Robot 
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=19877
Signed-off-by: Jiapeng Chong 
Link: https://patch.msgid.link/20250331083150.72598-1-jiapeng.chong@linux.alibaba.com
Signed-off-by: Johannes Berg

um: fix _nofault accesses

2025-05-05T08:06:51+00:00

Nathan reported [1] that when built with clang, the um kernel
crashes pretty much immediately. This turned out to be an issue
with the inline assembly I had added, when clang used %rax/%eax
for both operands. Reorder it so current->thread.segv_continue
is written first, and then the lifetime of _faulted won't have
overlap with the lifetime of segv_continue.

In the email thread Benjamin also pointed out that current->mm
is only NULL for true kernel tasks, but we could do this for a
userspace task, so the current->thread.segv_continue logic must
be lifted out of the mm==NULL check.

Finally, while looking at this, put a barrier() so the NULL
assignment to thread.segv_continue cannot be reorder before
the possibly faulting operation.

Reported-by: Nathan Chancellor 
Closes: https://lore.kernel.org/r/20250402221254.GA384@ax162 [1]
Fixes: d1d7f01f7cd3 ("um: mark rodata read-only and implement _nofault accesses")
Tested-by: Nathan Chancellor 
Signed-off-by: Johannes Berg

um: mark rodata read-only and implement _nofault accesses

2025-03-18T10:03:14+00:00

Mark read-only data actually read-only (simple mprotect), and
to be able to test it also implement _nofault accesses. This
works by setting up a new "segv_continue" pointer in current,
and then when we hit a segfault we change the signal return
context so that we continue at that address. The code using
this sets it up so that it jumps to a label and then aborts
the access that way, returning -EFAULT.

It's possible to optimize the ___backtrack_faulted() thing by
using asm goto (compiler version dependent) and/or gcc's (not
sure if clang has it) &&label extension, but at least in one
attempt I made the && caused the compiler to not load -EFAULT
into the register in case of jumping to the &&label from the
fault handler. So leave it like this for now.

Signed-off-by: Johannes Berg 
Co-developed-by: Benjamin Berg 
Signed-off-by: Benjamin Berg 
Link: https://patch.msgid.link/20250210160926.420133-2-benjamin@sipsolutions.net
Signed-off-by: Johannes Berg

um: remove fault_catcher infrastructure

2024-10-23T07:52:46+00:00

This was perhaps intended to do _nofault copies, but the
real reason is lost to history. Remove this, it's not
needed, and using longjmp() out of the middle of the
signal handler with all the state it has modified is
not going to be a good idea anyway.

Link: https://patch.msgid.link/20241010224513.901c4d390b3e.Ia74742668b44603c1ca23dd36f90e964e6e7ee55@changeid
Signed-off-by: Johannes Berg

um: refactor TLB update handling

2024-07-03T15:09:50+00:00

Conceptually, we want the memory mappings to always be up to date and
represent whatever is in the TLB. To ensure that, we need to sync them
over in the userspace case and for the kernel we need to process the
mappings.

The kernel will call flush_tlb_* if page table entries that were valid
before become invalid. Unfortunately, this is not the case if entries
are added.

As such, change both flush_tlb_* and set_ptes to track the memory range
that has to be synchronized. For the kernel, we need to execute a
flush_tlb_kern_* immediately but we can wait for the first page fault in
case of set_ptes. For userspace in contrast we only store that a range
of memory needs to be synced and do so whenever we switch to that
process.

Signed-off-by: Benjamin Berg 
Link: https://patch.msgid.link/20240703134536.1161108-13-benjamin@sipsolutions.net
Signed-off-by: Johannes Berg

mm: always expand the stack with the mmap write lock held

2023-06-27T16:41:30+00:00

This finishes the job of always holding the mmap write lock when
extending the user stack vma, and removes the 'write_locked' argument
from the vm helper functions again.

For some cases, we just avoid expanding the stack at all: drivers and
page pinning really shouldn't be extending any stacks.  Let's see if any
strange users really wanted that.

It's worth noting that architectures that weren't converted to the new
lock_mm_and_find_vma() helper function are left using the legacy
"expand_stack()" function, but it has been changed to drop the mmap_lock
and take it for writing while expanding the vma.  This makes it fairly
straightforward to convert the remaining architectures.

As a result of dropping and re-taking the lock, the calling conventions
for this function have also changed, since the old vma may no longer be
valid.  So it will now return the new vma if successful, and NULL - and
the lock dropped - if the area could not be extended.

Tested-by: Vegard Nossum 
Tested-by: John Paul Adrian Glaubitz  # ia64
Tested-by: Frank Scheiner  # ia64
Signed-off-by: Linus Torvalds

mm: avoid unnecessary page fault retires on shared memory types

2022-06-17T02:48:27+00:00

I observed that for each of the shared file-backed page faults, we're very
likely to retry one more time for the 1st write fault upon no page.  It's
because we'll need to release the mmap lock for dirty rate limit purpose
with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).

Then after that throttling we return VM_FAULT_RETRY.

We did that probably because VM_FAULT_RETRY is the only way we can return
to the fault handler at that time telling it we've released the mmap lock.

However that's not ideal because it's very likely the fault does not need
to be retried at all since the pgtable was well installed before the
throttling, so the next continuous fault (including taking mmap read lock,
walk the pgtable, etc.) could be in most cases unnecessary.

It's not only slowing down page faults for shared file-backed, but also add
more mmap lock contention which is in most cases not needed at all.

To observe this, one could try to write to some shmem page and look at
"pgfault" value in /proc/vmstat, then we should expect 2 counts for each
shmem write simply because we retried, and vm event "pgfault" will capture
that.

To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
show that we've completed the whole fault and released the lock.  It's also
a hint that we should very possibly not need another fault immediately on
this page because we've just completed it.

This patch provides a ~12% perf boost on my aarch64 test VM with a simple
program sequentially dirtying 400MB shmem file being mmap()ed and these are
the time it needs:

  Before: 650.980 ms (+-1.94%)
  After:  569.396 ms (+-1.38%)

I believe it could help more than that.

We need some special care on GUP and the s390 pgfault handler (for gmap
code before returning from pgfault), the rest changes in the page fault
handlers should be relatively straightforward.

Another thing to mention is that mm_account_fault() does take this new
fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.

I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
them as-is.

Link: https://lkml.kernel.org/r/20220530183450.42886-1-peterx@redhat.com
Signed-off-by: Peter Xu 
Acked-by: Geert Uytterhoeven 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Johannes Weiner 
Acked-by: Vineet Gupta 
Acked-by: Guo Ren 
Acked-by: Max Filippov 
Acked-by: Christian Borntraeger 
Acked-by: Michael Ellerman  (powerpc)
Acked-by: Catalin Marinas 
Reviewed-by: Alistair Popple 
Reviewed-by: Ingo Molnar 
Acked-by: Russell King (Oracle) 	[arm part]
Acked-by: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Stafford Horne 
Cc: David S. Miller 
Cc: Johannes Berg 
Cc: Brian Cain 
Cc: Richard Henderson 
Cc: Richard Weinberger 
Cc: Benjamin Herrenschmidt 
Cc: Thomas Gleixner 
Cc: Janosch Frank 
Cc: Albert Ou 
Cc: Anton Ivanov 
Cc: Dave Hansen 
Cc: Borislav Petkov 
Cc: Sven Schnelle 
Cc: Andrea Arcangeli 
Cc: James Bottomley 
Cc: Al Viro 
Cc: Alexander Gordeev 
Cc: Jonas Bonn 
Cc: Will Deacon 
Cc: Vlastimil Babka 
Cc: Michal Simek 
Cc: Matt Turner 
Cc: Paul Mackerras 
Cc: David Hildenbrand 
Cc: Nicholas Piggin 
Cc: Palmer Dabbelt 
Cc: Stefan Kristiansson 
Cc: Paul Walmsley 
Cc: Ivan Kokshaysky 
Cc: Chris Zankel 
Cc: Hugh Dickins 
Cc: Dinh Nguyen 
Cc: Rich Felker 
Cc: H. Peter Anvin 
Cc: Andy Lutomirski 
Cc: Thomas Bogendoerfer 
Cc: Helge Deller 
Cc: Yoshinori Sato 
Signed-off-by: Andrew Morton