linux.git/arch/x86/kernel/process_64.c, branch v6.14

x86/mm: Cleanup prctl_enable_tagged_addr() nr_bits error checking

2024-07-02T18:33:44+00:00

There are two separate checks in prctl_enable_tagged_addr() that nr_bits
is in the correct range. The checks are arranged such the correct case
is sandwiched between both error cases, which do exactly the same thing.

Simplify the if condition and pull the correct case outside with the
rest of the success code path.

Signed-off-by: Yosry Ahmed 
Signed-off-by: Dave Hansen 
Reviewed-by: Kirill A. Shutemov 
Link: https://lore.kernel.org/all/20240702132139.3332013-4-yosryahmed%40google.com

x86/mm: Fix LAM inconsistency during context switch

2024-07-02T18:32:16+00:00

LAM can only be enabled when a process is single-threaded.  But _kernel_
threads can temporarily use a single-threaded process's mm.  That means
that a context-switching kernel thread can race and observe the mm's LAM
metadata (mm->context.lam_cr3_mask) change.

The context switch code does two logical things with that metadata:
populate CR3 and populate 'cpu_tlbstate.lam'.  If it hits this race,
'cpu_tlbstate.lam' and CR3 can end up out of sync.

This de-synchronization is currently harmless.  But it is confusing and
might lead to warnings or real bugs.

Update set_tlbstate_lam_mode() to take in the LAM mask and untag mask
instead of an mm_struct pointer, and while we are at it, rename it to
cpu_tlbstate_update_lam(). This should also make it clearer that we are
updating cpu_tlbstate. In switch_mm_irqs_off(), read the LAM mask once
and use it for both the cpu_tlbstate update and the CR3 update.

Signed-off-by: Yosry Ahmed 
Signed-off-by: Dave Hansen 
Reviewed-by: Kirill A. Shutemov 
Link: https://lore.kernel.org/all/20240702132139.3332013-3-yosryahmed%40google.com

x86/mm: Use IPIs to synchronize LAM enablement

2024-07-02T18:31:51+00:00

LAM can only be enabled when a process is single-threaded.  But _kernel_
threads can temporarily use a single-threaded process's mm.

If LAM is enabled by a userspace process while a kthread is using its
mm, the kthread will not observe LAM enablement (i.e.  LAM will be
disabled in CR3). This could be fine for the kthread itself, as LAM only
affects userspace addresses. However, if the kthread context switches to
a thread in the same userspace process, CR3 may or may not be updated
because the mm_struct doesn't change (based on pending TLB flushes). If
CR3 is not updated, the userspace thread will run incorrectly with LAM
disabled, which may cause page faults when using tagged addresses.
Example scenario:

CPU 1                                   CPU 2
/* kthread */
kthread_use_mm()
                                        /* user thread */
                                        prctl_enable_tagged_addr()
                                        /* LAM enabled on CPU 2 */
/* LAM disabled on CPU 1 */
                                        context_switch() /* to CPU 1 */
/* Switching to user thread */
switch_mm_irqs_off()
/* CR3 not updated */
/* LAM is still disabled on CPU 1 */

Synchronize LAM enablement by sending an IPI to all CPUs running with
the mm_struct to enable LAM. This makes sure LAM is enabled on CPU 1
in the above scenario before prctl_enable_tagged_addr() returns and
userspace starts using tagged addresses, and before it's possible to
run the userspace process on CPU 1.

In switch_mm_irqs_off(), move reading the LAM mask until after
mm_cpumask() is updated. This ensures that if an outdated LAM mask is
written to CR3, an IPI is received to update it right after IRQs are
re-enabled.

[ dhansen: Add a LAM enabling helper and comment it ]

Fixes: 82721d8b25d7 ("x86/mm: Handle LAM on context switch")
Suggested-by: Andy Lutomirski 
Signed-off-by: Yosry Ahmed 
Signed-off-by: Dave Hansen 
Reviewed-by: Kirill A. Shutemov 
Link: https://lore.kernel.org/all/20240702132139.3332013-2-yosryahmed%40google.com

x86/cpu: Fix check for RDPKRU in __show_regs()

2024-04-24T12:30:21+00:00

cpu_feature_enabled(X86_FEATURE_OSPKE) does not necessarily reflect
whether CR4.PKE is set on the CPU.  In particular, they may differ on
non-BSP CPUs before setup_pku() is executed.  In this scenario, RDPKRU
will #UD causing the system to hang.

Fix by checking CR4 for PKE enablement which is always correct for the
current CPU.

The scenario happens by inserting a WARN* before setup_pku() in
identiy_cpu() or some other diagnostic which would lead to calling
__show_regs().

  [ bp: Massage commit message. ]

Signed-off-by: David Kaplan 
Signed-off-by: Borislav Petkov (AMD) 
Link: https://lore.kernel.org/r/20240421191728.32239-1-bp@kernel.org

Merge tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2024-03-12T02:53:15+00:00

Pull core x86 updates from Ingo Molnar:

 - The biggest change is the rework of the percpu code, to support the
   'Named Address Spaces' GCC feature, by Uros Bizjak:

      - This allows C code to access GS and FS segment relative memory
        via variables declared with such attributes, which allows the
        compiler to better optimize those accesses than the previous
        inline assembly code.

      - The series also includes a number of micro-optimizations for
        various percpu access methods, plus a number of cleanups of %gs
        accesses in assembly code.

      - These changes have been exposed to linux-next testing for the
        last ~5 months, with no known regressions in this area.

 - Fix/clean up __switch_to()'s broken but accidentally working handling
   of FPU switching - which also generates better code

 - Propagate more RIP-relative addressing in assembly code, to generate
   slightly better code

 - Rework the CPU mitigations Kconfig space to be less idiosyncratic, to
   make it easier for distros to follow & maintain these options

 - Rework the x86 idle code to cure RCU violations and to clean up the
   logic

 - Clean up the vDSO Makefile logic

 - Misc cleanups and fixes

* tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
  x86/idle: Select idle routine only once
  x86/idle: Let prefer_mwait_c1_over_halt() return bool
  x86/idle: Cleanup idle_setup()
  x86/idle: Clean up idle selection
  x86/idle: Sanitize X86_BUG_AMD_E400 handling
  sched/idle: Conditionally handle tick broadcast in default_idle_call()
  x86: Increase brk randomness entropy for 64-bit systems
  x86/vdso: Move vDSO to mmap region
  x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together
  x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o
  x86/retpoline: Ensure default return thunk isn't used at runtime
  x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32
  x86/vdso: Use $(addprefix ) instead of $(foreach )
  x86/vdso: Simplify obj-y addition
  x86/vdso: Consolidate targets and clean-files
  x86/bugs: Rename CONFIG_RETHUNK              => CONFIG_MITIGATION_RETHUNK
  x86/bugs: Rename CONFIG_CPU_SRSO             => CONFIG_MITIGATION_SRSO
  x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY       => CONFIG_MITIGATION_IBRS_ENTRY
  x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY      => CONFIG_MITIGATION_UNRET_ENTRY
  x86/bugs: Rename CONFIG_SLS                  => CONFIG_MITIGATION_SLS
  ...

x86/fred: Allow single-step trap and NMI when starting a new task

2024-01-31T21:02:00+00:00

Entering a new task is logically speaking a return from a system call
(exec, fork, clone, etc.). As such, if ptrace enables single stepping
a single step exception should be allowed to trigger immediately upon
entering user space. This is not optional.

NMI should *never* be disabled in user space. As such, this is an
optional, opportunistic way to catch errors.

Allow single-step trap and NMI when starting a new task, thus once
the new task enters user space, single-step trap and NMI are both
enabled immediately.

Signed-off-by: H. Peter Anvin (Intel) 
Signed-off-by: Xin Li 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Borislav Petkov (AMD) 
Tested-by: Shan Kang 
Link: https://lore.kernel.org/r/20231205105030.8698-21-xin3.li@intel.com

x86/fred: Disallow the swapgs instruction when FRED is enabled

2024-01-31T21:01:41+00:00

SWAPGS is no longer needed thus NOT allowed with FRED because FRED
transitions ensure that an operating system can _always_ operate
with its own GS base address:

  - For events that occur in ring 3, FRED event delivery swaps the GS
    base address with the IA32_KERNEL_GS_BASE MSR.

  - ERETU (the FRED transition that returns to ring 3) also swaps the
    GS base address with the IA32_KERNEL_GS_BASE MSR.

And the operating system can still setup the GS segment for a user
thread without the need of loading a user thread GS with:

  - Using LKGS, available with FRED, to modify other attributes of the
    GS segment without compromising its ability always to operate with
    its own GS base address.

  - Accessing the GS segment base address for a user thread as before
    using RDMSR or WRMSR on the IA32_KERNEL_GS_BASE MSR.

Note, LKGS loads the GS base address into the IA32_KERNEL_GS_BASE MSR
instead of the GS segment's descriptor cache. As such, the operating
system never changes its runtime GS base address.

Signed-off-by: H. Peter Anvin (Intel) 
Signed-off-by: Xin Li 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Borislav Petkov (AMD) 
Tested-by: Shan Kang 
Link: https://lore.kernel.org/r/20231205105030.8698-19-xin3.li@intel.com

x86/ptrace: Cleanup the definition of the pt_regs structure

2024-01-31T21:01:13+00:00

struct pt_regs is hard to read because the member or section related
comments are not aligned with the members.

The 'cs' and 'ss' members of pt_regs are type of 'unsigned long' while
in reality they are only 16-bit wide. This works so far as the
remaining space is unused, but FRED will use the remaining bits for
other purposes.

To prepare for FRED:

  - Cleanup the formatting
  - Convert 'cs' and 'ss' to u16 and embed them into an union
    with a u64
  - Fixup the related printk() format strings

Suggested-by: Thomas Gleixner 
Originally-by: H. Peter Anvin (Intel) 
Signed-off-by: Xin Li 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Borislav Petkov (AMD) 
Tested-by: Shan Kang 
Link: https://lore.kernel.org/r/20231205105030.8698-14-xin3.li@intel.com

x86/fpu: Clean up FPU switching in the middle of task switching

2023-10-20T09:24:22+00:00

It happens to work, but it's very very wrong, because our 'current'
macro is magic that is supposedly loading a stable value.

It just happens to be not quite stable enough and the compilers
re-load the value enough for this code to work.  But it's wrong.

The whole

        struct fpu *prev_fpu = &prev->fpu;

thing in __switch_to() is pretty ugly. There's no reason why we
should look at that 'prev_fpu' pointer there, or pass it down.

And it only generates worse code, in how it loads 'current' when
__switch_to() has the right task pointers.

The attached patch not only cleans this up, it actually
generates better code too:

 (a) it removes one push/pop pair at entry/exit because there's one
     less register used (no 'current')

 (b) it removes that pointless load of 'current' because it just uses
     the right argument:

	-       movq    %gs:pcpu_hot(%rip), %r12
	-       testq   $16384, (%r12)
	+       testq   $16384, (%rdi)

Signed-off-by: Linus Torvalds 
Signed-off-by: Uros Bizjak 
Signed-off-by: Ingo Molnar 
Link: https://lore.kernel.org/r/20231018184227.446318-1-ubizjak@gmail.com

x86/shstk: Add ARCH_SHSTK_STATUS

2023-08-02T22:01:51+00:00

CRIU and GDB need to get the current shadow stack and WRSS enablement
status. This information is already available via /proc/pid/status, but
this is inconvenient for CRIU because it involves parsing the text output
in an area of the code where this is difficult. Provide a status
arch_prctl(), ARCH_SHSTK_STATUS for retrieving the status. Have arg2 be a
userspace address, and make the new arch_prctl simply copy the features
out to userspace.

Suggested-by: Mike Rapoport 
Signed-off-by: Rick Edgecombe 
Signed-off-by: Dave Hansen 
Reviewed-by: Borislav Petkov (AMD) 
Reviewed-by: Kees Cook 
Acked-by: Mike Rapoport (IBM) 
Tested-by: Pengfei Xu 
Tested-by: John Allen 
Tested-by: Kees Cook 
Link: https://lore.kernel.org/all/20230613001108.3040476-43-rick.p.edgecombe%40intel.com