linux.git/arch/arm64/kernel/syscall.c, branch v6.6

arm64: syscall: unmask DAIF earlier for SVCs

2023-08-11T11:23:48+00:00

For a number of historical reasons, when handling SVCs we don't unmask
DAIF in el0_svc() or el0_svc_compat(), and instead do so later in
el0_svc_common(). This is unfortunate and makes it harder to make
changes to the DAIF management in entry-common.c as we'd like to do as
cleanup and preparation for FEAT_NMI support. We can move the DAIF
unmasking to entry-common.c as long as we also hoist the
fp_user_discard() logic, as reasoned below.

We converted the syscall trace logic from assembly to C in commit:

  f37099b6992a0b81 ("arm64: convert syscall trace logic to C")

... which was intended to have no functional change, and mirrored the
existing assembly logic to avoid the risk of any functional regression.

With the logic in C, it's clear that there is currently no reason to
unmask DAIF so late within el0_svc_common():

* The thread flags are read prior to unmasking DAIF, but are not
  consumed until after DAIF is unmasked, and we don't perform a
  read-modify-write sequence of the thread flags for which we might need
  to serialize against an IPI modifying the flags. Similarly, for any
  thread flags set by other threads, whether DAIF is masked or not has
  no impact.

  The read_thread_flags() helpers performs a single-copy-atomic read of
  the flags, and so this can safely be moved after unmasking DAIF.

* The pt_regs::orig_x0 and pt_regs::syscallno fields are neither
  consumed nor modified by the handler for any DAIF exception (e.g.
  these do not exist in the `perf_event_arm_regs` enum and are not
  sampled by perf in its IRQ handler).

  Thus, the manipulation of pt_regs::orig_x0 and pt_regs::syscallno can
  safely be moved after unmasking DAIF.

Given the above, we can safely hoist unmasking of DAIF out of
el0_svc_common(), and into its immediate callers: do_el0_svc() and
do_el0_svc_compat(). Further:

* In do_el0_svc(), we sample the syscall number from
  pt_regs::regs[8]. This is not modified by the handler for any DAIF
  exception, and thus can safely be moved after unmasking DAIF.

  As fp_user_discard() operates on the live FP/SVE/SME register state,
  this needs to occur before we clear DAIF.IF, as interrupts could
  result in preemption which would cause this state to become foreign.
  As fp_user_discard() is the first function called within do_el0_svc(),
  it has no dependency on other parts of do_el0_svc() and can be moved
  earlier so long as it is called prior to unmasking DAIF.IF.

* In do_el0_svc_compat(), we sample the syscall number from
  pt_regs::regs[7]. This is not modified by the handler for any DAIF
  exception, and thus can safely be moved after unmasking DAIF.

  Compat threads cannot use SVE or SME, so there's no need for
  el0_svc_compat() to call fp_user_discard().

Given the above, we can safely hoist the unmasking of DAIF out of
do_el0_svc() and do_el0_svc_compat(), and into their immediate callers:
el0_svc() and el0_svc_compat(), so long a we also hoist
fp_user_discard() into el0_svc().

Signed-off-by: Mark Rutland 
Cc: Catalin Marinas 
Cc: Marc Zyngier 
Cc: Mark Brown 
Cc: Will Deacon 
Reviewed-by: Mark Brown 
Link: https://lore.kernel.org/r/20230808101148.1064172-1-mark.rutland@arm.com
Signed-off-by: Will Deacon

tracing: arm64: Avoid missing-prototype warnings

2023-07-12T16:06:04+00:00

These are all tracing W=1 warnings in arm64 allmodconfig about missing
prototypes:

kernel/trace/trace_kprobe_selftest.c:7:5: error: no previous prototype for 'kprobe_trace_selftest_target' [-Werror=missing-pro
totypes]
kernel/trace/ftrace.c:329:5: error: no previous prototype for '__register_ftrace_function' [-Werror=missing-prototypes]
kernel/trace/ftrace.c:372:5: error: no previous prototype for '__unregister_ftrace_function' [-Werror=missing-prototypes]
kernel/trace/ftrace.c:4130:15: error: no previous prototype for 'arch_ftrace_match_adjust' [-Werror=missing-prototypes]
kernel/trace/fgraph.c:243:15: error: no previous prototype for 'ftrace_return_to_handler' [-Werror=missing-prototypes]
kernel/trace/fgraph.c:358:6: error: no previous prototype for 'ftrace_graph_sleep_time_control' [-Werror=missing-prototypes]
arch/arm64/kernel/ftrace.c:460:6: error: no previous prototype for 'prepare_ftrace_return' [-Werror=missing-prototypes]
arch/arm64/kernel/ptrace.c:2172:5: error: no previous prototype for 'syscall_trace_enter' [-Werror=missing-prototypes]
arch/arm64/kernel/ptrace.c:2195:6: error: no previous prototype for 'syscall_trace_exit' [-Werror=missing-prototypes]

Move the declarations to an appropriate header where they can be seen
by the caller and callee, and make sure the headers are included where
needed.

Link: https://lore.kernel.org/linux-trace-kernel/20230517125215.930689-1-arnd@kernel.org

Cc: Masami Hiramatsu 
Cc: Mark Rutland 
Cc: Will Deacon 
Cc: Kees Cook 
Cc: Florent Revest 
Signed-off-by: Arnd Bergmann 
Acked-by: Catalin Marinas 
[ Fixed ftrace_return_to_handler() to handle CONFIG_HAVE_FUNCTION_GRAPH_RETVAL case ]
Signed-off-by: Steven Rostedt (Google)

arm64: syscall: unmask DAIF for tracing status

2023-06-07T17:23:22+00:00

The following code:
static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
                            const syscall_fn_t syscall_table[])
{
    ...

    if (!has_syscall_work(flags) && !IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
        local_daif_mask();
        flags = read_thread_flags(); -------------------------------- A
        if (!has_syscall_work(flags) && !(flags & _TIF_SINGLESTEP))
             return;
        local_daif_restore(DAIF_PROCCTX);
     }

trace_exit:
     syscall_trace_exit(regs); ------- B
}

1. The flags in the if conditional statement should be used
in the same way as the flags in the function syscall_trace_exit,
because DAIF is not shielded in the function syscall_trace_exit,
so when the flags are obtained in line A of the code,
there is no need to shield DAIF.
Don't care about the modification of flags after line A.

2. Masking DAIF caused syscall performance to deteriorate by about 10%.
The Unixbench single core syscall performance data is as follows:

Machine: Kunpeng 920

Mask DAIF:
System Call Overhead                    1172314.1 lps   (10.0 s, 7 samples)

System Benchmarks Partial Index          BASELINE       RESULT    INDEX
System Call Overhead                      15000.0    1172314.1    781.5
                                                               ========
System Benchmarks Index Score (Partial Only)                      781.5

Unmask DAIF:
System Call Overhead                    1287944.6 lps   (10.0 s, 7 samples)

System Benchmarks Partial Index          BASELINE       RESULT    INDEX
System Call Overhead                      15000.0    1287944.6    858.6
                                                               ========
System Benchmarks Index Score (Partial Only)                      858.6

Rationale from Mark Rutland as for why this is safe:

This masking is an artifact of the old "ret_fast_syscall" assembly that
was converted to C in commit:

  f37099b6992a0b81 ("arm64: convert syscall trace logic to C")

The assembly would mask DAIF, check the thread flags, and fall through
to kernel_exit without unmasking if no tracing was needed. The
conversion copied this masking into the C version, though this wasn't
strictly necessary.

As (in general) thread flags can be manipulated by other threads, it's
not safe to manipulate the thread flags with plain reads and writes, and
since commit:

  342b3808786518ce ("arm64: Snapshot thread flags")

... we use read_thread_flags() to read the flags atomically.

With this, there is no need to mask DAIF transiently around reading the
flags, as we only decide whether to trace while DAIF is masked, and the
actual tracing occurs with DAIF unmasked. When el0_svc_common() returns
its caller will unconditionally mask DAIF via exit_to_user_mode(), so
the masking is redundant.

Signed-off-by: Guo Hui 
Reviewed-by: Mark Rutland 
Link: https://lore.kernel.org/r/20230526024715.8773-1-guohui@uniontech.com
[catalin.marinas@arm.com: updated comment with Mark's rationale]
Signed-off-by: Catalin Marinas

arm64/sme: Optimise SME exit on syscall entry

2023-01-12T17:09:21+00:00

Our ABI says that we exit streaming mode on syscall entry. Currently we
check if we are in streaming mode before doing this but since we have a
SMSTOP SM instruction which will clear SVCR.SM in a single atomic operation
we can save ourselves the read of the system register and check of the flag
and just unconditionally do the SMSTOP SM. If we are not in streaming mode
it results in a noop change to SVCR, if we are in streaming mode we will
exit as desired.

No functional change.

Signed-off-by: Mark Brown 
Link: https://lore.kernel.org/r/20230110-arm64-sme-syscall-smstop-v1-1-ac94235fd810@kernel.org
Signed-off-by: Catalin Marinas

arm64/sve: Leave SVE enabled on syscall if we don't context switch

2022-11-29T15:01:56+00:00

The syscall ABI says that the SVE register state not shared with FPSIMD
may not be preserved on syscall, and this is the only mechanism we have
in the ABI to stop tracking the extra SVE state for a process. Currently
we do this unconditionally by means of disabling SVE for the process on
syscall, causing userspace to take a trap to EL1 if it uses SVE again.
These extra traps result in a noticeable overhead for using SVE instead
of FPSIMD in some workloads, especially for simple syscalls where we can
return directly to userspace and would not otherwise need to update the
floating point registers. Tests with fp-pidbench show an approximately
70% overhead on a range of implementations when SVE is in use - while
this is an extreme and entirely artificial benchmark it is clear that
there is some useful room for improvement here.

Now that we have the ability to track the decision about what to save
seprately to TIF_SVE we can improve things by leaving TIF_SVE enabled on
syscall but only saving the FPSIMD registers if we are in a syscall.
This means that if we need to restore the register state from memory
(eg, after a context switch or kernel mode NEON) we will drop TIF_SVE
and reenable traps for userspace but if we can just return to userspace
then traps will remain disabled.

Since our current implementation and hence ABI has the effect of zeroing
all the SVE register state not shared with FPSIMD on syscall we replace
the disabling of TIF_SVE with a flush of the non-shared register state,
this means that there is still some overhead for syscalls when SVE is in
use but it is very much reduced.

Signed-off-by: Mark Brown 
Reviewed-by: Catalin Marinas 
Reviewed-by: Marc Zyngier 
Link: https://lore.kernel.org/r/20221115094640.112848-8-broonie@kernel.org
Signed-off-by: Will Deacon

treewide: use get_random_{u8,u16}() when possible, part 1

2022-10-11T23:42:58+00:00

Rather than truncate a 32-bit value to a 16-bit value or an 8-bit value,
simply use the get_random_{u8,u16}() functions, which are faster than
wasting the additional bytes from a 32-bit value. This was done
mechanically with this coccinelle script:

@@
expression E;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u16;
typedef __be16;
typedef __le16;
typedef u8;
@@
(
- (get_random_u32() & 0xffff)
+ get_random_u16()
|
- (get_random_u32() & 0xff)
+ get_random_u8()
|
- (get_random_u32() % 65536)
+ get_random_u16()
|
- (get_random_u32() % 256)
+ get_random_u8()
|
- (get_random_u32() >> 16)
+ get_random_u16()
|
- (get_random_u32() >> 24)
+ get_random_u8()
|
- (u16)get_random_u32()
+ get_random_u16()
|
- (u8)get_random_u32()
+ get_random_u8()
|
- (__be16)get_random_u32()
+ (__be16)get_random_u16()
|
- (__le16)get_random_u32()
+ (__le16)get_random_u16()
|
- prandom_u32_max(65536)
+ get_random_u16()
|
- prandom_u32_max(256)
+ get_random_u8()
|
- E->inet_id = get_random_u32()
+ E->inet_id = get_random_u16()
)

@@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u16;
identifier v;
@@
- u16 v = get_random_u32();
+ u16 v = get_random_u16();

@@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u8;
identifier v;
@@
- u8 v = get_random_u32();
+ u8 v = get_random_u8();

@@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u16;
u16 v;
@@
-  v = get_random_u32();
+  v = get_random_u16();

@@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u8;
u8 v;
@@
-  v = get_random_u32();
+  v = get_random_u8();

// Find a potential literal
@literal_mask@
expression LITERAL;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
position p;
@@

        ((T)get_random_u32()@p & (LITERAL))

// Examine limits
@script:python add_one@
literal << literal_mask.LITERAL;
RESULT;
@@

value = None
if literal.startswith('0x'):
        value = int(literal, 16)
elif literal[0] in '123456789':
        value = int(literal, 10)
if value is None:
        print("I don't know how to handle %s" % (literal))
        cocci.include_match(False)
elif value < 256:
        coccinelle.RESULT = cocci.make_ident("get_random_u8")
elif value < 65536:
        coccinelle.RESULT = cocci.make_ident("get_random_u16")
else:
        print("Skipping large mask of %s" % (literal))
        cocci.include_match(False)

// Replace the literal mask with the calculated result.
@plus_one@
expression literal_mask.LITERAL;
position literal_mask.p;
identifier add_one.RESULT;
identifier FUNC;
@@

-       (FUNC()@p & (LITERAL))
+       (RESULT() & LITERAL)

Reviewed-by: Greg Kroah-Hartman 
Reviewed-by: Kees Cook 
Reviewed-by: Yury Norov 
Acked-by: Jakub Kicinski 
Acked-by: Toke Høiland-Jørgensen  # for sch_cake
Signed-off-by: Jason A. Donenfeld

arm64/sme: Remove _EL0 from name of SVCR - FIXME sysreg.h

2022-05-16T18:50:20+00:00

The defines for SVCR call it SVCR_EL0 however the architecture calls the
register SVCR with no _EL0 suffix. In preparation for generating the sysreg
definitions rename to match the architecture, no functional change.

Signed-off-by: Mark Brown 
Link: https://lore.kernel.org/r/20220510161208.631259-6-broonie@kernel.org
Signed-off-by: Catalin Marinas

arm64/sme: Implement traps and syscall handling for SME

2022-04-22T17:51:05+00:00

By default all SME operations in userspace will trap.  When this happens
we allocate storage space for the SME register state, set up the SVE
registers and disable traps.  We do not need to initialize ZA since the
architecture guarantees that it will be zeroed when enabled and when we
trap ZA is disabled.

On syscall we exit streaming mode if we were previously in it and ensure
that all but the lower 128 bits of the registers are zeroed while
preserving the state of ZA. This follows the aarch64 PCS for SME, ZA
state is preserved over a function call and streaming mode is exited.
Since the traps for SME do not distinguish between streaming mode SVE
and ZA usage if ZA is in use rather than reenabling traps we instead
zero the parts of the SVE registers not shared with FPSIMD and leave SME
enabled, this simplifies handling SME traps. If ZA is not in use then we
reenable SME traps and fall through to normal handling of SVE.

Signed-off-by: Mark Brown 
Reviewed-by: Catalin Marinas 
Link: https://lore.kernel.org/r/20220419112247.711548-17-broonie@kernel.org
Signed-off-by: Catalin Marinas

arm64: Snapshot thread flags

2021-11-30T23:06:44+00:00

Some thread flags can be set remotely, and so even when IRQs are disabled,
the flags can change under our feet. Generally this is unlikely to cause a
problem in practice, but it is somewhat unsound, and KCSAN will
legitimately warn that there is a data race.

To avoid such issues, a snapshot of the flags has to be taken prior to
using them. Some places already use READ_ONCE() for that, others do not.

Convert them all to the new flag accessor helpers.

Signed-off-by: Mark Rutland 
Signed-off-by: Thomas Gleixner 
Acked-by: Will Deacon 
Acked-by: Paul E. McKenney 
Cc: Catalin Marinas 
Link: https://lore.kernel.org/r/20211129130653.2037928-7-mark.rutland@arm.com

arm64: fix compat syscall return truncation

2021-08-03T09:35:03+00:00

Due to inconsistencies in the way we manipulate compat GPRs, we have a
few issues today:

* For audit and tracing, where error codes are handled as a (native)
  long, negative error codes are expected to be sign-extended to the
  native 64-bits, or they may fail to be matched correctly. Thus a
  syscall which fails with an error may erroneously be identified as
  failing.

* For ptrace, *all* compat return values should be sign-extended for
  consistency with 32-bit arm, but we currently only do this for
  negative return codes.

* As we may transiently set the upper 32 bits of some compat GPRs while
  in the kernel, these can be sampled by perf, which is somewhat
  confusing. This means that where a syscall returns a pointer above 2G,
  this will be sign-extended, but will not be mistaken for an error as
  error codes are constrained to the inclusive range [-4096, -1] where
  no user pointer can exist.

To fix all of these, we must consistently use helpers to get/set the
compat GPRs, ensuring that we never write the upper 32 bits of the
return code, and always sign-extend when reading the return code.  This
patch does so, with the following changes:

* We re-organise syscall_get_return_value() to always sign-extend for
  compat tasks, and reimplement syscall_get_error() atop. We update
  syscall_trace_exit() to use syscall_get_return_value().

* We consistently use syscall_set_return_value() to set the return
  value, ensureing the upper 32 bits are never set unexpectedly.

* As the core audit code currently uses regs_return_value() rather than
  syscall_get_return_value(), we special-case this for
  compat_user_mode(regs) such that this will do the right thing. Going
  forward, we should try to move the core audit code over to
  syscall_get_return_value().

Cc: 
Reported-by: He Zhe 
Reported-by: weiyuchen 
Cc: Catalin Marinas 
Cc: Will Deacon 
Reviewed-by: Catalin Marinas 
Link: https://lore.kernel.org/r/20210802104200.21390-1-mark.rutland@arm.com
Signed-off-by: Will Deacon