linux.git/kernel/exit.c, branch v6.4-rc2

Merge tag 'trace-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

2023-04-28T22:57:53+00:00

Pull tracing updates from Steven Rostedt:

- User events are finally ready!

After lots of collaboration between various parties, we finally
locked down on a stable interface for user events that can also work
with user space only tracing.

This is implemented by telling the kernel (or user space library, but
that part is user space only and not part of this patch set), where
the variable is that the application uses to know if something is
listening to the trace.

There's also an interface to tell the kernel about these events,
which will show up in the /sys/kernel/tracing/events/user_events/
directory, where it can be enabled.

When it's enabled, the kernel will update the variable, to tell the
application to start writing to the kernel.

See https://lwn.net/Articles/927595/

- Cleaned up the direct trampolines code to simplify arm64 addition of
direct trampolines.

Direct trampolines use the ftrace interface but instead of jumping to
the ftrace trampoline, applications (mostly BPF) can register their
own trampoline for performance reasons.

- Some updates to the fprobe infrastructure. fprobes are more efficient
than kprobes, as it does not need to save all the registers that
kprobes on ftrace do. More work needs to be done before the fprobes
will be exposed as dynamic events.

- More updates to references to the obsolete path of
/sys/kernel/debug/tracing for the new /sys/kernel/tracing path.

- Add a seq_buf_do_printk() helper to seq_bufs, to print a large buffer
line by line instead of all at once.

There are users in production kernels that have a large data dump
that originally used printk() directly, but the data dump was larger
than what printk() allowed as a single print.

Using seq_buf() to do the printing fixes that.

- Add /sys/kernel/tracing/touched_functions that shows all functions
that was every traced by ftrace or a direct trampoline. This is used
for debugging issues where a traced function could have caused a
crash by a bpf program or live patching.

- Add a "fields" option that is similar to "raw" but outputs the fields
of the events. It's easier to read by humans.

- Some minor fixes and clean ups.

* tag 'trace-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (41 commits)
ring-buffer: Sync IRQ works before buffer destruction
tracing: Add missing spaces in trace_print_hex_seq()
ring-buffer: Ensure proper resetting of atomic variables in ring_buffer_reset_online_cpus
recordmcount: Fix memory leaks in the uwrite function
tracing/user_events: Limit max fault-in attempts
tracing/user_events: Prevent same address and bit per process
tracing/user_events: Ensure bit is cleared on unregister
tracing/user_events: Ensure write index cannot be negative
seq_buf: Add seq_buf_do_printk() helper
tracing: Fix print_fields() for __dyn_loc/__rel_loc
tracing/user_events: Set event filter_type from type
ring-buffer: Clearly check null ptr returned by rb_set_head_page()
tracing: Unbreak user events
tracing/user_events: Use print_format_fields() for trace output
tracing/user_events: Align structs with tabs for readability
tracing/user_events: Limit global user_event count
tracing/user_events: Charge event allocs to cgroups
tracing/user_events: Update documentation for ABI
tracing/user_events: Use write ABI in example
tracing/user_events: Add ABI self-test
...

tracing/user_events: Track fork/exec/exit for mm lifetime

2023-03-29T10:52:08+00:00

During tracefs discussions it was decided instead of requiring a mapping
within a user-process to track the lifetime of memory descriptors we
should hook the appropriate calls. Do this by adding the minimal stubs
required for task fork, exec, and exit. Currently this is just a NOP.
Future patches will implement these calls fully.

Link: https://lkml.kernel.org/r/20230328235219.203-3-beaub@linux.microsoft.com

Suggested-by: Mathieu Desnoyers 
Signed-off-by: Beau Belgrave 
Signed-off-by: Steven Rostedt (Google)

lazy tlb: introduce lazy tlb mm refcount helper functions

2023-03-28T23:20:08+00:00

Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting. 
This makes the lazy tlb mm references more obvious, and allows the
refcounting scheme to be modified in later changes.  There is no
functional change with this patch.

Link: https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@gmail.com
Signed-off-by: Nicholas Piggin 
Acked-by: Linus Torvalds 
Cc: Andy Lutomirski 
Cc: Catalin Marinas 
Cc: Christophe Leroy 
Cc: Dave Hansen 
Cc: Michael Ellerman 
Cc: Nadav Amit 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Will Deacon 
Signed-off-by: Andrew Morton

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

2023-02-21T23:27:48+00:00

Pull arm64 updates from Catalin Marinas:

 - Support for arm64 SME 2 and 2.1. SME2 introduces a new 512-bit
   architectural register (ZT0, for the look-up table feature) that
   Linux needs to save/restore

 - Include TPIDR2 in the signal context and add the corresponding
   kselftests

 - Perf updates: Arm SPEv1.2 support, HiSilicon uncore PMU updates, ACPI
   support to the Marvell DDR and TAD PMU drivers, reset DTM_PMU_CONFIG
   (ARM CMN) at probe time

 - Support for DYNAMIC_FTRACE_WITH_CALL_OPS on arm64

 - Permit EFI boot with MMU and caches on. Instead of cleaning the
   entire loaded kernel image to the PoC and disabling the MMU and
   caches before branching to the kernel bare metal entry point, leave
   the MMU and caches enabled and rely on EFI's cacheable 1:1 mapping of
   all of system RAM to populate the initial page tables

 - Expose the AArch32 (compat) ELF_HWCAP features to user in an arm64
   kernel (the arm32 kernel only defines the values)

 - Harden the arm64 shadow call stack pointer handling: stash the shadow
   stack pointer in the task struct on interrupt, load it directly from
   this structure

 - Signal handling cleanups to remove redundant validation of size
   information and avoid reading the same data from userspace twice

 - Refactor the hwcap macros to make use of the automatically generated
   ID registers. It should make new hwcaps writing less error prone

 - Further arm64 sysreg conversion and some fixes

 - arm64 kselftest fixes and improvements

 - Pointer authentication cleanups: don't sign leaf functions, unify
   asm-arch manipulation

 - Pseudo-NMI code generation optimisations

 - Minor fixes for SME and TPIDR2 handling

 - Miscellaneous updates: ARCH_FORCE_MAX_ORDER is now selectable,
   replace strtobool() to kstrtobool() in the cpufeature.c code, apply
   dynamic shadow call stack in two passes, intercept pfn changes in
   set_pte_at() without the required break-before-make sequence, attempt
   to dump all instructions on unhandled kernel faults

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (130 commits)
  arm64: fix .idmap.text assertion for large kernels
  kselftest/arm64: Don't require FA64 for streaming SVE+ZA tests
  kselftest/arm64: Copy whole EXTRA context
  arm64: kprobes: Drop ID map text from kprobes blacklist
  perf: arm_spe: Print the version of SPE detected
  perf: arm_spe: Add support for SPEv1.2 inverted event filtering
  perf: Add perf_event_attr::config3
  arm64/sme: Fix __finalise_el2 SMEver check
  drivers/perf: fsl_imx8_ddr_perf: Remove set-but-not-used variable
  arm64/signal: Only read new data when parsing the ZT context
  arm64/signal: Only read new data when parsing the ZA context
  arm64/signal: Only read new data when parsing the SVE context
  arm64/signal: Avoid rereading context frame sizes
  arm64/signal: Make interface for restore_fpsimd_context() consistent
  arm64/signal: Remove redundant size validation from parse_user_sigframe()
  arm64/signal: Don't redundantly verify FPSIMD magic
  arm64/cpufeature: Use helper macros to specify hwcaps
  arm64/cpufeature: Always use symbolic name for feature value in hwcaps
  arm64/sysreg: Initial unsigned annotations for ID registers
  arm64/sysreg: Initial annotation of signed ID registers
  ...

Compiler attributes: GCC cold function alignment workarounds

2023-01-24T11:49:42+00:00

Contemporary versions of GCC (e.g. GCC 12.2.0) drop the alignment
specified by '-falign-functions=N' for functions marked with the
__cold__ attribute, and potentially for callees of __cold__ functions as
these may be implicitly marked as __cold__ by the compiler. LLVM appears
to respect '-falign-functions=N' in such cases.

This has been reported to GCC in bug 88345:

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345

... which also covers alignment being dropped when '-Os' is used, which
will be addressed in a separate patch.

Currently, use of '-falign-functions=N' is limited to
CONFIG_FUNCTION_ALIGNMENT, which is largely used for performance and/or
analysis reasons (e.g. with CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B), but
isn't necessary for correct functionality. However, this dropped
alignment isn't great for the performance and/or analysis cases.

Subsequent patches will use CONFIG_FUNCTION_ALIGNMENT as part of arm64's
ftrace implementation, which will require all instrumented functions to
be aligned to at least 8-bytes.

This patch works around the dropped alignment by avoiding the use of the
__cold__ attribute when CONFIG_FUNCTION_ALIGNMENT is non-zero, and by
specifically aligning abort(), which GCC implicitly marks as __cold__.
As the __cold macro is now dependent upon config options (which is
against the policy described at the top of compiler_attributes.h), it is
moved into compiler_types.h.

I've tested this by building and booting a kernel configured with
defconfig + CONFIG_EXPERT=y + CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y,
and looking for misaligned text symbols in /proc/kallsyms:

* arm64:

  Before:
    # uname -rm
    6.2.0-rc3 aarch64
    # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
    5009

  After:
    # uname -rm
    6.2.0-rc3-00001-g2a2bedf8bfa9 aarch64
    # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
    919

* x86_64:

  Before:
    # uname -rm
    6.2.0-rc3 x86_64
    # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
    11537

  After:
    # uname -rm
    6.2.0-rc3-00001-g2a2bedf8bfa9 x86_64
    # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l
    2805

There's clearly a substantial reduction in the number of misaligned
symbols. From manual inspection, the remaining unaligned text labels are
a combination of ACPICA functions (due to the use of '-Os'), static call
trampolines, and non-function labels in assembly, which will be dealt
with in subsequent patches.

Signed-off-by: Mark Rutland 
Cc: Florent Revest 
Cc: Masami Hiramatsu 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Cc: Will Deacon 
Cc: Miguel Ojeda 
Cc: Nick Desaulniers 
Link: https://lore.kernel.org/r/20230123134603.1064407-3-mark.rutland@arm.com
Signed-off-by: Catalin Marinas

exit: Detect and fix irq disabled state in oops

2023-01-20T23:06:10+00:00

If a task oopses with irqs disabled, this can cause various cascading
problems in the oops path such as sleep-from-invalid warnings, and
potentially worse.

Since commit 0258b5fd7c712 ("coredump: Limit coredumps to a single
thread group"), the unconditional irq enable in coredump_task_exit()
will "fix" the irq state to be enabled early in do_exit(), so currently
this may not be triggerable, but that is coincidental and fragile.

Detect and fix the irqs_disabled() condition in the oops path before
calling do_exit(), similarly to the way in_atomic() is handled.

Reported-by: Michael Ellerman 
Signed-off-by: Nicholas Piggin 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: "Eric W. Biederman" 
Link: https://lore.kernel.org/lkml/20221004094401.708299-1-npiggin@gmail.com/

exit: Use READ_ONCE() for all oops/warn limit reads

2022-12-16T20:26:57+00:00

Use a temporary variable to take full advantage of READ_ONCE() behavior.
Without this, the report (and even the test) might be out of sync with
the initial test.

Reported-by: Peter Zijlstra 
Link: https://lore.kernel.org/lkml/Y5x7GXeluFmZ8E0E@hirez.programming.kicks-ass.net
Fixes: 9fc9e278a5c0 ("panic: Introduce warn_limit")
Fixes: d4ccd54d28d3 ("exit: Put an upper limit on how often we can oops")
Cc: "Eric W. Biederman" 
Cc: Jann Horn 
Cc: Arnd Bergmann 
Cc: Petr Mladek 
Cc: Andrew Morton 
Cc: Luis Chamberlain 
Cc: Marco Elver 
Cc: tangmeng 
Cc: Sebastian Andrzej Siewior 
Cc: Tiezhu Yang 
Signed-off-by: Kees Cook

exit: Allow oops_limit to be disabled

2022-12-02T21:04:39+00:00

In preparation for keeping oops_limit logic in sync with warn_limit,
have oops_limit == 0 disable checking the Oops counter.

Cc: Jann Horn 
Cc: Jonathan Corbet 
Cc: Andrew Morton 
Cc: Baolin Wang 
Cc: "Jason A. Donenfeld" 
Cc: Eric Biggers 
Cc: Huang Ying 
Cc: "Eric W. Biederman" 
Cc: Arnd Bergmann 
Cc: linux-doc@vger.kernel.org
Signed-off-by: Kees Cook

exit: Expose "oops_count" to sysfs

2022-12-01T16:50:38+00:00

Since Oops count is now tracked and is a fairly interesting signal, add
the entry /sys/kernel/oops_count to expose it to userspace.

Cc: "Eric W. Biederman" 
Cc: Jann Horn 
Cc: Arnd Bergmann 
Reviewed-by: Luis Chamberlain 
Signed-off-by: Kees Cook 
Link: https://lore.kernel.org/r/20221117234328.594699-3-keescook@chromium.org

exit: Put an upper limit on how often we can oops

2022-12-01T16:50:38+00:00

Many Linux systems are configured to not panic on oops; but allowing an
attacker to oops the system **really** often can make even bugs that look
completely unexploitable exploitable (like NULL dereferences and such) if
each crash elevates a refcount by one or a lock is taken in read mode, and
this causes a counter to eventually overflow.

The most interesting counters for this are 32 bits wide (like open-coded
refcounts that don't use refcount_t). (The ldsem reader count on 32-bit
platforms is just 16 bits, but probably nobody cares about 32-bit platforms
that much nowadays.)

So let's panic the system if the kernel is constantly oopsing.

The speed of oopsing 2^32 times probably depends on several factors, like
how long the stack trace is and which unwinder you're using; an empirically
important one is whether your console is showing a graphical environment or
a text console that oopses will be printed to.
In a quick single-threaded benchmark, it looks like oopsing in a vfork()
child with a very short stack trace only takes ~510 microseconds per run
when a graphical console is active; but switching to a text console that
oopses are printed to slows it down around 87x, to ~45 milliseconds per
run.
(Adding more threads makes this faster, but the actual oops printing
happens under &die_lock on x86, so you can maybe speed this up by a factor
of around 2 and then any further improvement gets eaten up by lock
contention.)

It looks like it would take around 8-12 days to overflow a 32-bit counter
with repeated oopsing on a multi-core X86 system running a graphical
environment; both me (in an X86 VM) and Seth (with a distro kernel on
normal hardware in a standard configuration) got numbers in that ballpark.

12 days aren't *that* short on a desktop system, and you'd likely need much
longer on a typical server system (assuming that people don't run graphical
desktop environments on their servers), and this is a *very* noisy and
violent approach to exploiting the kernel; and it also seems to take orders
of magnitude longer on some machines, probably because stuff like EFI
pstore will slow it down a ton if that's active.

Signed-off-by: Jann Horn 
Link: https://lore.kernel.org/r/20221107201317.324457-1-jannh@google.com
Reviewed-by: Luis Chamberlain 
Signed-off-by: Kees Cook 
Link: https://lore.kernel.org/r/20221117234328.594699-2-keescook@chromium.org