linux.git/arch/arm64/kernel/syscall.c, branch v7.1

randomize_kstack: Unify random source across arches

2026-03-25T04:12:03+00:00

Previously different architectures were using random sources of
differing strength and cost to decide the random kstack offset. A number
of architectures (loongarch, powerpc, s390, x86) were using their
timestamp counter, at whatever the frequency happened to be. Other
arches (arm64, riscv) were using entropy from the crng via
get_random_u16().

There have been concerns that in some cases the timestamp counters may
be too weak, because they can be easily guessed or influenced by user
space. And get_random_u16() has been shown to be too costly for the
level of protection kstack offset randomization provides.

So let's use a common, architecture-agnostic source of entropy; a
per-cpu prng, seeded at boot-time from the crng. This has a few
benefits:

  - We can remove choose_random_kstack_offset(); That was only there to
    try to make the timestamp counter value a bit harder to influence
    from user space [*].

  - The architecture code is simplified. All it has to do now is call
    add_random_kstack_offset() in the syscall path.

  - The strength of the randomness can be reasoned about independently
    of the architecture.

  - Arches previously using get_random_u16() now have much faster
    syscall paths, see below results.

[*] Additionally, this gets rid of some redundant work on s390 and x86.
Before this patch, those architectures called
choose_random_kstack_offset() under arch_exit_to_user_mode_prepare(),
which is also called for exception returns to userspace which were *not*
syscalls (e.g. regular interrupts). Getting rid of
choose_random_kstack_offset() avoids a small amount of redundant work
for the non-syscall cases.

In some configurations, add_random_kstack_offset() will now call
instrumentable code, so for a couple of arches, I have moved the call a
bit later to the first point where instrumentation is allowed. This
doesn't impact the efficacy of the mechanism.

There have been some claims that a prng may be less strong than the
timestamp counter if not regularly reseeded. But the prng has a period
of about 2^113. So as long as the prng state remains secret, it should
not be possible to guess. If the prng state can be accessed, we have
bigger problems.

Additionally, we are only consuming 6 bits to randomize the stack, so
there are only 64 possible random offsets. I assert that it would be
trivial for an attacker to brute force by repeating their attack and
waiting for the random stack offset to be the desired one. The prng
approach seems entirely proportional to this level of protection.

Performance data are provided below. The baseline is v6.18 with rndstack
on for each respective arch. (I)/(R) indicate statistically significant
improvement/regression. arm64 platform is AWS Graviton3 (m7g.metal).
x86_64 platform is AWS Sapphire Rapids (m7i.24xlarge):

+-----------------+--------------+---------------+---------------+
| Benchmark       | Result Class |  per-cpu-prng |  per-cpu-prng |
|                 |              | arm64 (metal) |   x86_64 (VM) |
+=================+==============+===============+===============+
| syscall/getpid  | mean (ns)    |    (I) -9.50% |   (I) -17.65% |
|                 | p99 (ns)     |   (I) -59.24% |   (I) -24.41% |
|                 | p99.9 (ns)   |   (I) -59.52% |   (I) -28.52% |
+-----------------+--------------+---------------+---------------+
| syscall/getppid | mean (ns)    |    (I) -9.52% |   (I) -19.24% |
|                 | p99 (ns)     |   (I) -59.25% |   (I) -25.03% |
|                 | p99.9 (ns)   |   (I) -59.50% |   (I) -28.17% |
+-----------------+--------------+---------------+---------------+
| syscall/invalid | mean (ns)    |   (I) -10.31% |   (I) -18.56% |
|                 | p99 (ns)     |   (I) -60.79% |   (I) -20.06% |
|                 | p99.9 (ns)   |   (I) -61.04% |   (I) -25.04% |
+-----------------+--------------+---------------+---------------+

I tested an earlier version of this change on x86 bare metal and it
showed a smaller but still significant improvement. The bare metal
system wasn't available this time around so testing was done in a VM
instance. I'm guessing the cost of rdtsc is higher for VMs.

Acked-by: Mark Rutland 
Signed-off-by: Ryan Roberts 
Link: https://patch.msgid.link/20260303150840.3789438-3-ryan.roberts@arm.com
Signed-off-by: Kees Cook

arm64: add unlikely hint to MTE async fault check in el0_svc_common

2025-11-11T19:49:19+00:00

Add unlikely() hint to the _TIF_MTE_ASYNC_FAULT flag check in
el0_svc_common() since asynchronous MTE faults are expected to be
rare occurrences during normal system call execution.

This optimization helps the compiler to improve instruction caching
and branch prediction for the common case where no asynchronous
MTE faults are pending, while maintaining correct behavior for
the exceptional case where such faults need to be handled prior
to system call execution.

Signed-off-by: Li Qiang 
Signed-off-by: Catalin Marinas

arm/syscalls: mark syscall invocation as likely in invoke_syscall

2025-09-22T12:26:16+00:00

The invoke_syscall() function is overwhelmingly called for
valid system call entries. Annotate the main path with likely()
to help the compiler generate better branch prediction hints,
reducing CPU pipeline stalls due to mispredictions.

This is a micro-optimization targeting syscall-heavy workloads [1].

Link: https://lore.kernel.org/r/20250922121730.986761-1-pengcan@kylinos.cn [1]
Signed-off-by: Can Peng 
Signed-off-by: Will Deacon

arm64: convert unistd_32.h to syscall.tbl format

2024-07-10T12:23:38+00:00

This is a straight conversion from the old asm/unistd32.h into the
format used by 32-bit arm and most other architectures, calling scripts
to generate the asm/unistd32.h header and a new asm/syscalls32.h headers.

I used a semi-automated text replacement method to do the conversion,
and then used 'vimdiff' to synchronize the whitespace and the (unused)
names of the non-compat syscalls with the arm version.

There are two differences between the generated syscalls names and the
old version:

 - the old asm/unistd32.h contained only a __NR_sync_file_range2
   entry, while the arm32 version also defines
   __NR_arm_sync_file_range with the same number. I added this
   duplicate back in asm/unistd32.h.

 - __NR__sysctl was removed from the arm64 file a while ago, but
   all the tables still contain it. This should probably get removed
   everywhere but I added it here for consistency.

On top of that, the arm64 version does not contain any references to
the 32-bit OABI syscalls that are not supported by arm64. If we ever
want to share the file between arm32 and arm64, it would not be
hard to add support for both in one file.

Acked-by: Catalin Marinas 
Signed-off-by: Arnd Bergmann

randomize_kstack: Remove non-functional per-arch entropy filtering

2024-06-28T15:54:56+00:00

An unintended consequence of commit 9c573cd31343 ("randomize_kstack:
Improve entropy diffusion") was that the per-architecture entropy size
filtering reduced how many bits were being added to the mix, rather than
how many bits were being used during the offsetting. All architectures
fell back to the existing default of 0x3FF (10 bits), which will consume
at most 1KiB of stack space. It seems that this is working just fine,
so let's avoid the confusion and update everything to use the default.

The prior intent of the per-architecture limits were:

  arm64: capped at 0x1FF (9 bits), 5 bits effective
  powerpc: uncapped (10 bits), 6 or 7 bits effective
  riscv: uncapped (10 bits), 6 bits effective
  x86: capped at 0xFF (8 bits), 5 (x86_64) or 6 (ia32) bits effective
  s390: capped at 0xFF (8 bits), undocumented effective entropy

Current discussion has led to just dropping the original per-architecture
filters. The additional entropy appears to be safe for arm64, x86,
and s390. Quoting Arnd, "There is no point pretending that 15.75KB is
somehow safe to use while 15.00KB is not."

Co-developed-by: Yuntao Liu 
Signed-off-by: Yuntao Liu 
Fixes: 9c573cd31343 ("randomize_kstack: Improve entropy diffusion")
Link: https://lore.kernel.org/r/20240617133721.377540-1-liuyuntao12@huawei.com
Reviewed-by: Arnd Bergmann 
Acked-by: Mark Rutland 
Acked-by: Heiko Carstens  # s390
Link: https://lore.kernel.org/r/20240619214711.work.953-kees@kernel.org
Signed-off-by: Kees Cook

arm64: remove unnecessary ifdefs around is_compat_task()

2024-02-28T18:01:23+00:00

Currently some parts of the codebase will test for CONFIG_COMPAT before
testing is_compat_task().

is_compat_task() is a inlined function only present on CONFIG_COMPAT.
On the other hand, for !CONFIG_COMPAT, we have in linux/compat.h:

 #define is_compat_task() (0)

Since we have this define available in every usage of is_compat_task() for
!CONFIG_COMPAT, it's unnecessary to keep the ifdefs, since the compiler is
smart enough to optimize-out those snippets on CONFIG_COMPAT=n

This requires some regset code as well as a few other defines to be made
available on !CONFIG_COMPAT, so some symbols can get resolved before
getting optimized-out.

Signed-off-by: Leonardo Bras 
Reviewed-by: Arnd Bergmann 
Link: https://lore.kernel.org/r/20240109034651.478462-2-leobras@redhat.com
Signed-off-by: Catalin Marinas

arm64: syscall: unmask DAIF earlier for SVCs

2023-08-11T11:23:48+00:00

For a number of historical reasons, when handling SVCs we don't unmask
DAIF in el0_svc() or el0_svc_compat(), and instead do so later in
el0_svc_common(). This is unfortunate and makes it harder to make
changes to the DAIF management in entry-common.c as we'd like to do as
cleanup and preparation for FEAT_NMI support. We can move the DAIF
unmasking to entry-common.c as long as we also hoist the
fp_user_discard() logic, as reasoned below.

We converted the syscall trace logic from assembly to C in commit:

  f37099b6992a0b81 ("arm64: convert syscall trace logic to C")

... which was intended to have no functional change, and mirrored the
existing assembly logic to avoid the risk of any functional regression.

With the logic in C, it's clear that there is currently no reason to
unmask DAIF so late within el0_svc_common():

* The thread flags are read prior to unmasking DAIF, but are not
  consumed until after DAIF is unmasked, and we don't perform a
  read-modify-write sequence of the thread flags for which we might need
  to serialize against an IPI modifying the flags. Similarly, for any
  thread flags set by other threads, whether DAIF is masked or not has
  no impact.

  The read_thread_flags() helpers performs a single-copy-atomic read of
  the flags, and so this can safely be moved after unmasking DAIF.

* The pt_regs::orig_x0 and pt_regs::syscallno fields are neither
  consumed nor modified by the handler for any DAIF exception (e.g.
  these do not exist in the `perf_event_arm_regs` enum and are not
  sampled by perf in its IRQ handler).

  Thus, the manipulation of pt_regs::orig_x0 and pt_regs::syscallno can
  safely be moved after unmasking DAIF.

Given the above, we can safely hoist unmasking of DAIF out of
el0_svc_common(), and into its immediate callers: do_el0_svc() and
do_el0_svc_compat(). Further:

* In do_el0_svc(), we sample the syscall number from
  pt_regs::regs[8]. This is not modified by the handler for any DAIF
  exception, and thus can safely be moved after unmasking DAIF.

  As fp_user_discard() operates on the live FP/SVE/SME register state,
  this needs to occur before we clear DAIF.IF, as interrupts could
  result in preemption which would cause this state to become foreign.
  As fp_user_discard() is the first function called within do_el0_svc(),
  it has no dependency on other parts of do_el0_svc() and can be moved
  earlier so long as it is called prior to unmasking DAIF.IF.

* In do_el0_svc_compat(), we sample the syscall number from
  pt_regs::regs[7]. This is not modified by the handler for any DAIF
  exception, and thus can safely be moved after unmasking DAIF.

  Compat threads cannot use SVE or SME, so there's no need for
  el0_svc_compat() to call fp_user_discard().

Given the above, we can safely hoist the unmasking of DAIF out of
do_el0_svc() and do_el0_svc_compat(), and into their immediate callers:
el0_svc() and el0_svc_compat(), so long a we also hoist
fp_user_discard() into el0_svc().

Signed-off-by: Mark Rutland 
Cc: Catalin Marinas 
Cc: Marc Zyngier 
Cc: Mark Brown 
Cc: Will Deacon 
Reviewed-by: Mark Brown 
Link: https://lore.kernel.org/r/20230808101148.1064172-1-mark.rutland@arm.com
Signed-off-by: Will Deacon

tracing: arm64: Avoid missing-prototype warnings

2023-07-12T16:06:04+00:00

These are all tracing W=1 warnings in arm64 allmodconfig about missing
prototypes:

kernel/trace/trace_kprobe_selftest.c:7:5: error: no previous prototype for 'kprobe_trace_selftest_target' [-Werror=missing-pro
totypes]
kernel/trace/ftrace.c:329:5: error: no previous prototype for '__register_ftrace_function' [-Werror=missing-prototypes]
kernel/trace/ftrace.c:372:5: error: no previous prototype for '__unregister_ftrace_function' [-Werror=missing-prototypes]
kernel/trace/ftrace.c:4130:15: error: no previous prototype for 'arch_ftrace_match_adjust' [-Werror=missing-prototypes]
kernel/trace/fgraph.c:243:15: error: no previous prototype for 'ftrace_return_to_handler' [-Werror=missing-prototypes]
kernel/trace/fgraph.c:358:6: error: no previous prototype for 'ftrace_graph_sleep_time_control' [-Werror=missing-prototypes]
arch/arm64/kernel/ftrace.c:460:6: error: no previous prototype for 'prepare_ftrace_return' [-Werror=missing-prototypes]
arch/arm64/kernel/ptrace.c:2172:5: error: no previous prototype for 'syscall_trace_enter' [-Werror=missing-prototypes]
arch/arm64/kernel/ptrace.c:2195:6: error: no previous prototype for 'syscall_trace_exit' [-Werror=missing-prototypes]

Move the declarations to an appropriate header where they can be seen
by the caller and callee, and make sure the headers are included where
needed.

Link: https://lore.kernel.org/linux-trace-kernel/20230517125215.930689-1-arnd@kernel.org

Cc: Masami Hiramatsu 
Cc: Mark Rutland 
Cc: Will Deacon 
Cc: Kees Cook 
Cc: Florent Revest 
Signed-off-by: Arnd Bergmann 
Acked-by: Catalin Marinas 
[ Fixed ftrace_return_to_handler() to handle CONFIG_HAVE_FUNCTION_GRAPH_RETVAL case ]
Signed-off-by: Steven Rostedt (Google)

arm64: syscall: unmask DAIF for tracing status

2023-06-07T17:23:22+00:00

The following code:
static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
                            const syscall_fn_t syscall_table[])
{
    ...

    if (!has_syscall_work(flags) && !IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
        local_daif_mask();
        flags = read_thread_flags(); -------------------------------- A
        if (!has_syscall_work(flags) && !(flags & _TIF_SINGLESTEP))
             return;
        local_daif_restore(DAIF_PROCCTX);
     }

trace_exit:
     syscall_trace_exit(regs); ------- B
}

1. The flags in the if conditional statement should be used
in the same way as the flags in the function syscall_trace_exit,
because DAIF is not shielded in the function syscall_trace_exit,
so when the flags are obtained in line A of the code,
there is no need to shield DAIF.
Don't care about the modification of flags after line A.

2. Masking DAIF caused syscall performance to deteriorate by about 10%.
The Unixbench single core syscall performance data is as follows:

Machine: Kunpeng 920

Mask DAIF:
System Call Overhead                    1172314.1 lps   (10.0 s, 7 samples)

System Benchmarks Partial Index          BASELINE       RESULT    INDEX
System Call Overhead                      15000.0    1172314.1    781.5
                                                               ========
System Benchmarks Index Score (Partial Only)                      781.5

Unmask DAIF:
System Call Overhead                    1287944.6 lps   (10.0 s, 7 samples)

System Benchmarks Partial Index          BASELINE       RESULT    INDEX
System Call Overhead                      15000.0    1287944.6    858.6
                                                               ========
System Benchmarks Index Score (Partial Only)                      858.6

Rationale from Mark Rutland as for why this is safe:

This masking is an artifact of the old "ret_fast_syscall" assembly that
was converted to C in commit:

  f37099b6992a0b81 ("arm64: convert syscall trace logic to C")

The assembly would mask DAIF, check the thread flags, and fall through
to kernel_exit without unmasking if no tracing was needed. The
conversion copied this masking into the C version, though this wasn't
strictly necessary.

As (in general) thread flags can be manipulated by other threads, it's
not safe to manipulate the thread flags with plain reads and writes, and
since commit:

  342b3808786518ce ("arm64: Snapshot thread flags")

... we use read_thread_flags() to read the flags atomically.

With this, there is no need to mask DAIF transiently around reading the
flags, as we only decide whether to trace while DAIF is masked, and the
actual tracing occurs with DAIF unmasked. When el0_svc_common() returns
its caller will unconditionally mask DAIF via exit_to_user_mode(), so
the masking is redundant.

Signed-off-by: Guo Hui 
Reviewed-by: Mark Rutland 
Link: https://lore.kernel.org/r/20230526024715.8773-1-guohui@uniontech.com
[catalin.marinas@arm.com: updated comment with Mark's rationale]
Signed-off-by: Catalin Marinas

arm64/sme: Optimise SME exit on syscall entry

2023-01-12T17:09:21+00:00

Our ABI says that we exit streaming mode on syscall entry. Currently we
check if we are in streaming mode before doing this but since we have a
SMSTOP SM instruction which will clear SVCR.SM in a single atomic operation
we can save ourselves the read of the system register and check of the flag
and just unconditionally do the SMSTOP SM. If we are not in streaming mode
it results in a noop change to SVCR, if we are in streaming mode we will
exit as desired.

No functional change.

Signed-off-by: Mark Brown 
Link: https://lore.kernel.org/r/20230110-arm64-sme-syscall-smstop-v1-1-ac94235fd810@kernel.org
Signed-off-by: Catalin Marinas