linux-stable.git/arch/x86/kernel/process_32.c, branch linux-5.4.y

x86/resctl: fix scheduler confusion with 'current'

2023-03-11T15:44:16+00:00

commit 7fef099702527c3b2c5234a2ea6a24411485a13a upstream.

The implementation of 'current' on x86 is very intentionally special: it
is a very common thing to look up, and it uses 'this_cpu_read_stable()'
to get the current thread pointer efficiently from per-cpu storage.

And the keyword in there is 'stable': the current thread pointer never
changes as far as a single thread is concerned.  Even if when a thread
is preempted, or moved to another CPU, or even across an explicit call
'schedule()' that thread will still have the same value for 'current'.

It is, after all, the kernel base pointer to thread-local storage.
That's why it's stable to begin with, but it's also why it's important
enough that we have that special 'this_cpu_read_stable()' access for it.

So this is all done very intentionally to allow the compiler to treat
'current' as a value that never visibly changes, so that the compiler
can do CSE and combine multiple different 'current' accesses into one.

However, there is obviously one very special situation when the
currently running thread does actually change: inside the scheduler
itself.

So the scheduler code paths are special, and do not have a 'current'
thread at all.  Instead there are _two_ threads: the previous and the
next thread - typically called 'prev' and 'next' (or prev_p/next_p)
internally.

So this is all actually quite straightforward and simple, and not all
that complicated.

Except for when you then have special code that is run in scheduler
context, that code then has to be aware that 'current' isn't really a
valid thing.  Did you mean 'prev'? Did you mean 'next'?

In fact, even if then look at the code, and you use 'current' after the
new value has been assigned to the percpu variable, we have explicitly
told the compiler that 'current' is magical and always stable.  So the
compiler is quite free to use an older (or newer) value of 'current',
and the actual assignment to the percpu storage is not relevant even if
it might look that way.

Which is exactly what happened in the resctl code, that blithely used
'current' in '__resctrl_sched_in()' when it really wanted the new
process state (as implied by the name: we're scheduling 'into' that new
resctl state).  And clang would end up just using the old thread pointer
value at least in some configurations.

This could have happened with gcc too, and purely depends on random
compiler details.  Clang just seems to have been more aggressive about
moving the read of the per-cpu current_task pointer around.

The fix is trivial: just make the resctl code adhere to the scheduler
rules of using the prev/next thread pointer explicitly, instead of using
'current' in a situation where it just wasn't valid.

That same code is then also used outside of the scheduler context (when
a thread resctl state is explicitly changed), and then we will just pass
in 'current' as that pointer, of course.  There is no ambiguity in that
case.

The fix may be trivial, but noticing and figuring out what went wrong
was not.  The credit for that goes to Stephane Eranian.

Reported-by: Stephane Eranian 
Link: https://lore.kernel.org/lkml/20230303231133.1486085-1-eranian@google.com/
Link: https://lore.kernel.org/lkml/alpine.LFD.2.01.0908011214330.3304@localhost.localdomain/
Reviewed-by: Nick Desaulniers 
Tested-by: Tony Luck 
Tested-by: Stephane Eranian 
Tested-by: Babu Moger 
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

x86/fpu: Correct pkru/xstate inconsistency

2022-03-02T10:41:03+00:00

When eagerly switching PKRU in switch_fpu_finish() it checks that
current is not a kernel thread as kernel threads will never use PKRU.
It's possible that this_cpu_read_stable() on current_task
(ie. get_current()) is returning an old cached value. To resolve this
reference next_p directly rather than relying on current.

As written it's possible when switching from a kernel thread to a
userspace thread to observe a cached PF_KTHREAD flag and never restore
the PKRU. And as a result this issue only occurs when switching
from a kernel thread to a userspace thread, switching from a non kernel
thread works perfectly fine because all that is considered in that
situation are the flags from some other non kernel task and the next fpu
is passed in to switch_fpu_finish().

This behavior only exists between 5.2 and 5.13 when it was fixed by a
rewrite decoupling PKRU from xstate, in:
  commit 954436989cc5 ("x86/fpu: Remove PKRU handling from switch_fpu_finish()")

Unfortunately backporting the fix from 5.13 is probably not realistic as
it's part of a 60+ patch series which rewrites most of the PKRU handling.

Fixes: 0cecca9d03c9 ("x86/fpu: Eager switch PKRU state")
Signed-off-by: Brian Geffon 
Signed-off-by: Willis Kung 
Tested-by: Willis Kung 
Cc:  # v5.4.x
Cc:  # v5.10.x
Acked-by: Dave Hansen 
Signed-off-by: Greg Kroah-Hartman

x86/stackframe/32: Provide consistent pt_regs

2019-06-25T08:23:47+00:00

Currently pt_regs on x86_32 has an oddity in that kernel regs
(!user_mode(regs)) are short two entries (esp/ss). This means that any
code trying to use them (typically: regs->sp) needs to jump through
some unfortunate hoops.

Change the entry code to fix this up and create a full pt_regs frame.

This then simplifies various trampolines in ftrace and kprobes, the
stack unwinder, ptrace, kdump and kgdb.

Much thanks to Josh for help with the cleanups!

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Josh Poimboeuf 
Acked-by: Masami Hiramatsu 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Signed-off-by: Ingo Molnar

Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2019-05-07T17:24:10+00:00

Pull x86 FPU state handling updates from Borislav Petkov:
 "This contains work started by Rik van Riel and brought to fruition by
  Sebastian Andrzej Siewior with the main goal to optimize when to load
  FPU registers: only when returning to userspace and not on every
  context switch (while the task remains in the kernel).

  In addition, this optimization makes kernel_fpu_begin() cheaper by
  requiring registers saving only on the first invocation and skipping
  that in following ones.

  What is more, this series cleans up and streamlines many aspects of
  the already complex FPU code, hopefully making it more palatable for
  future improvements and simplifications.

  Finally, there's a __user annotations fix from Jann Horn"

* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
  x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails
  x86/pkeys: Add PKRU value to init_fpstate
  x86/fpu: Restore regs in copy_fpstate_to_sigframe() in order to use the fastpath
  x86/fpu: Add a fastpath to copy_fpstate_to_sigframe()
  x86/fpu: Add a fastpath to __fpu__restore_sig()
  x86/fpu: Defer FPU state load until return to userspace
  x86/fpu: Merge the two code paths in __fpu__restore_sig()
  x86/fpu: Restore from kernel memory on the 64-bit path too
  x86/fpu: Inline copy_user_to_fpregs_zeroing()
  x86/fpu: Update xstate's PKRU value on write_pkru()
  x86/fpu: Prepare copy_fpstate_to_sigframe() for TIF_NEED_FPU_LOAD
  x86/fpu: Always store the registers in copy_fpstate_to_sigframe()
  x86/entry: Add TIF_NEED_FPU_LOAD
  x86/fpu: Eager switch PKRU state
  x86/pkeys: Don't check if PKRU is zero before writing it
  x86/fpu: Only write PKRU if it is different from current
  x86/pkeys: Provide *pkru() helpers
  x86/fpu: Use a feature number instead of mask in two more helpers
  x86/fpu: Make __raw_xsave_addr() use a feature number instead of mask
  x86/fpu: Add an __fpregs_load_activate() internal helper
  ...

x86/fpu: Defer FPU state load until return to userspace

2019-04-12T17:34:47+00:00

Defer loading of FPU state until return to userspace. This gives
the kernel the potential to skip loading FPU state for tasks that
stay in kernel mode, or for tasks that end up with repeated
invocations of kernel_fpu_begin() & kernel_fpu_end().

The fpregs_lock/unlock() section ensures that the registers remain
unchanged. Otherwise a context switch or a bottom half could save the
registers to its FPU context and the processor's FPU registers would
became random if modified at the same time.

KVM swaps the host/guest registers on entry/exit path. This flow has
been kept as is. First it ensures that the registers are loaded and then
saves the current (host) state before it loads the guest's registers. The
swap is done at the very end with disabled interrupts so it should not
change anymore before theg guest is entered. The read/save version seems
to be cheaper compared to memcpy() in a micro benchmark.

Each thread gets TIF_NEED_FPU_LOAD set as part of fork() / fpu__copy().
For kernel threads, this flag gets never cleared which avoids saving /
restoring the FPU state for kernel threads and during in-kernel usage of
the FPU registers.

 [
   bp: Correct and update commit message and fix checkpatch warnings.
   s/register/registers/ where it is used in plural.
   minor comment corrections.
   remove unused trace_x86_fpu_activate_state() TP.
 ]

Signed-off-by: Rik van Riel 
Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Aubrey Li 
Cc: Babu Moger 
Cc: "Chang S. Bae" 
Cc: Dmitry Safonov 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: "Jason A. Donenfeld" 
Cc: Joerg Roedel 
Cc: Konrad Rzeszutek Wilk 
Cc: kvm ML 
Cc: Nicolai Stange 
Cc: Paolo Bonzini 
Cc: "Radim Krčmář" 
Cc: Tim Chen 
Cc: Waiman Long 
Cc: x86-ml 
Cc: Yi Wang 
Link: https://lkml.kernel.org/r/20190403164156.19645-24-bigeasy@linutronix.de

x86/fpu: Remove fpu->initialized

2019-04-10T13:42:40+00:00

The struct fpu.initialized member is always set to one for user tasks
and zero for kernel tasks. This avoids saving/restoring the FPU
registers for kernel threads.

The ->initialized = 0 case for user tasks has been removed in previous
changes, for instance, by doing an explicit unconditional init at fork()
time for FPU-less systems which was otherwise delayed until the emulated
opcode.

The context switch code (switch_fpu_prepare() + switch_fpu_finish())
can't unconditionally save/restore registers for kernel threads. Not
only would it slow down the switch but also load a zeroed xcomp_bv for
XSAVES.

For kernel_fpu_begin() (+end) the situation is similar: EFI with runtime
services uses this before alternatives_patched is true. Which means that
this function is used too early and it wasn't the case before.

For those two cases, use current->mm to distinguish between user and
kernel thread. For kernel_fpu_begin() skip save/restore of the FPU
registers.

During the context switch into a kernel thread don't do anything. There
is no reason to save the FPU state of a kernel thread.

The reordering in __switch_to() is important because the current()
pointer needs to be valid before switch_fpu_finish() is invoked so ->mm
is seen of the new task instead the old one.

N.B.: fpu__save() doesn't need to check ->mm because it is called by
user tasks only.

 [ bp: Massage. ]

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Aubrey Li 
Cc: Babu Moger 
Cc: "Chang S. Bae" 
Cc: Dmitry Safonov 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: "Jason A. Donenfeld" 
Cc: Joerg Roedel 
Cc: kvm ML 
Cc: Masami Hiramatsu 
Cc: Mathieu Desnoyers 
Cc: Nicolai Stange 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: Sergey Senozhatsky 
Cc: Will Deacon 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-8-bigeasy@linutronix.de

x86/fpu: Remove fpu__restore()

2019-04-09T17:27:42+00:00

There are no users of fpu__restore() so it is time to remove it. The
comment regarding fpu__restore() and TS bit is stale since commit

  b3b0870ef3ffe ("i387: do not preload FPU state at task switch time")

and has no meaning since.

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Borislav Petkov 
Reviewed-by: Dave Hansen 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Aubrey Li 
Cc: Babu Moger 
Cc: "Chang S. Bae" 
Cc: Dmitry Safonov 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: "Jason A. Donenfeld" 
Cc: Joerg Roedel 
Cc: Jonathan Corbet 
Cc: kvm ML 
Cc: linux-doc@vger.kernel.org
Cc: Nicolai Stange 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Rik van Riel 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190403164156.19645-3-bigeasy@linutronix.de

sched/x86: Save [ER]FLAGS on context switch

2019-04-03T07:36:27+00:00

Effectively reverts commit:

  2c7577a75837 ("sched/x86_64: Don't save flags on context switch")

Specifically because SMAP uses FLAGS.AC which invalidates the claim
that the kernel has clean flags.

In particular; while preemption from interrupt return is fine (the
IRET frame on the exception stack contains FLAGS) it breaks any code
that does synchonous scheduling, including preempt_enable().

This has become a significant issue ever since commit:

  5b24a7a2aa20 ("Add 'unsafe' user access functions for batched accesses")

provided for means of having 'normal' C code between STAC / CLAC,
exposing the FLAGS.AC state. So far this hasn't led to trouble,
however fix it before it comes apart.

Reported-by: Julien Thierry 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: stable@kernel.org
Fixes: 5b24a7a2aa20 ("Add 'unsafe' user access functions for batched accesses")
Signed-off-by: Ingo Molnar

Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2018-12-27T01:37:51+00:00

Pull x86 fpu updates from Ingo Molnar:
 "Misc preparatory changes for an upcoming FPU optimization that will
  delay the loading of FPU registers to return-to-userspace"

* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/fpu: Don't export __kernel_fpu_{begin,end}()
  x86/fpu: Update comment for __raw_xsave_addr()
  x86/fpu: Add might_fault() to user_insn()
  x86/pkeys: Make init_pkru_value static
  x86/thread_info: Remove _TIF_ALLWORK_MASK
  x86/process/32: Remove asm/math_emu.h include
  x86/fpu: Use unsigned long long shift in xfeature_uncompacted_offset()

Merge branch 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2018-12-26T20:17:43+00:00

Pull x86 cache control updates from Borislav Petkov:

 - The generalization of the RDT code to accommodate the addition of
   AMD's very similar implementation of the cache monitoring feature.

   This entails a subsystem move into a separate and generic
   arch/x86/kernel/cpu/resctrl/ directory along with adding
   vendor-specific initialization and feature detection helpers.

   Ontop of that is the unification of user-visible strings, both in the
   resctrl filesystem error handling and Kconfig.

   Provided by Babu Moger and Sherry Hurwitz.

 - Code simplifications and error handling improvements by Reinette
   Chatre.

* 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/resctrl: Fix rdt_find_domain() return value and checks
  x86/resctrl: Remove unnecessary check for cbm_validate()
  x86/resctrl: Use rdt_last_cmd_puts() where possible
  MAINTAINERS: Update resctrl filename patterns
  Documentation: Rename and update intel_rdt_ui.txt to resctrl_ui.txt
  x86/resctrl: Introduce AMD QOS feature
  x86/resctrl: Fixup the user-visible strings
  x86/resctrl: Add AMD's X86_FEATURE_MBA to the scattered CPUID features
  x86/resctrl: Rename the config option INTEL_RDT to RESCTRL
  x86/resctrl: Add vendor check for the MBA software controller
  x86/resctrl: Bring cbm_validate() into the resource structure
  x86/resctrl: Initialize the vendor-specific resource functions
  x86/resctrl: Move all the macros to resctrl/internal.h
  x86/resctrl: Re-arrange the RDT init code
  x86/resctrl: Rename the RDT functions and definitions
  x86/resctrl: Rename and move rdt files to a separate directory