linux.git/arch/x86/kernel/process_32.c, branch v6.8

x86/resctl: fix scheduler confusion with 'current'

2023-03-08T19:48:11+00:00

The implementation of 'current' on x86 is very intentionally special: it
is a very common thing to look up, and it uses 'this_cpu_read_stable()'
to get the current thread pointer efficiently from per-cpu storage.

And the keyword in there is 'stable': the current thread pointer never
changes as far as a single thread is concerned.  Even if when a thread
is preempted, or moved to another CPU, or even across an explicit call
'schedule()' that thread will still have the same value for 'current'.

It is, after all, the kernel base pointer to thread-local storage.
That's why it's stable to begin with, but it's also why it's important
enough that we have that special 'this_cpu_read_stable()' access for it.

So this is all done very intentionally to allow the compiler to treat
'current' as a value that never visibly changes, so that the compiler
can do CSE and combine multiple different 'current' accesses into one.

However, there is obviously one very special situation when the
currently running thread does actually change: inside the scheduler
itself.

So the scheduler code paths are special, and do not have a 'current'
thread at all.  Instead there are _two_ threads: the previous and the
next thread - typically called 'prev' and 'next' (or prev_p/next_p)
internally.

So this is all actually quite straightforward and simple, and not all
that complicated.

Except for when you then have special code that is run in scheduler
context, that code then has to be aware that 'current' isn't really a
valid thing.  Did you mean 'prev'? Did you mean 'next'?

In fact, even if then look at the code, and you use 'current' after the
new value has been assigned to the percpu variable, we have explicitly
told the compiler that 'current' is magical and always stable.  So the
compiler is quite free to use an older (or newer) value of 'current',
and the actual assignment to the percpu storage is not relevant even if
it might look that way.

Which is exactly what happened in the resctl code, that blithely used
'current' in '__resctrl_sched_in()' when it really wanted the new
process state (as implied by the name: we're scheduling 'into' that new
resctl state).  And clang would end up just using the old thread pointer
value at least in some configurations.

This could have happened with gcc too, and purely depends on random
compiler details.  Clang just seems to have been more aggressive about
moving the read of the per-cpu current_task pointer around.

The fix is trivial: just make the resctl code adhere to the scheduler
rules of using the prev/next thread pointer explicitly, instead of using
'current' in a situation where it just wasn't valid.

That same code is then also used outside of the scheduler context (when
a thread resctl state is explicitly changed), and then we will just pass
in 'current' as that pointer, of course.  There is no ambiguity in that
case.

The fix may be trivial, but noticing and figuring out what went wrong
was not.  The credit for that goes to Stephane Eranian.

Reported-by: Stephane Eranian 
Link: https://lore.kernel.org/lkml/20230303231133.1486085-1-eranian@google.com/
Link: https://lore.kernel.org/lkml/alpine.LFD.2.01.0908011214330.3304@localhost.localdomain/
Reviewed-by: Nick Desaulniers 
Tested-by: Tony Luck 
Tested-by: Stephane Eranian 
Tested-by: Babu Moger 
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

x86/percpu: Move current_top_of_stack next to current_task

2022-10-17T14:41:05+00:00

Extend the struct pcpu_hot cacheline with current_top_of_stack;
another very frequently used value.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lore.kernel.org/r/20220915111145.493038635@infradead.org

x86: Put hot per CPU variables into a struct

2022-10-17T14:41:03+00:00

The layout of per-cpu variables is at the mercy of the compiler. This
can lead to random performance fluctuations from build to build.

Create a structure to hold some of the hottest per-cpu variables,
starting with current_task.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lore.kernel.org/r/20220915111145.179707194@infradead.org

Merge tag 'x86_core_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2022-05-24T01:42:07+00:00

Pull core x86 updates from Borislav Petkov:

 - Remove all the code around GS switching on 32-bit now that it is not
   needed anymore

 - Other misc improvements

* tag 'x86_core_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  bug: Use normal relative pointers in 'struct bug_entry'
  x86/nmi: Make register_nmi_handler() more robust
  x86/asm: Merge load_gs_index()
  x86/32: Remove lazy GS macros
  ELF: Remove elf_core_copy_kernel_regs()
  x86/32: Simplify ELF_CORE_COPY_REGS

x86/prctl: Remove pointless task argument

2022-05-13T10:56:28+00:00

The functions invoked via do_arch_prctl_common() can only operate on
the current task and none of these function uses the task argument.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Borislav Petkov 
Link: https://lore.kernel.org/r/87lev7vtxj.ffs@tglx

x86/32: Remove lazy GS macros

2022-04-14T12:09:43+00:00

GS is always a user segment now.

Signed-off-by: Brian Gerst 
Signed-off-by: Borislav Petkov 
Reviewed-by: Thomas Gleixner 
Acked-by: Andy Lutomirski 
Link: https://lore.kernel.org/r/20220325153953.162643-4-brgerst@gmail.com

x86/fpu: Move context switch and exit to user inlines into sched.h

2021-10-20T13:27:27+00:00

internal.h is a kitchen sink which needs to get out of the way to prepare
for the upcoming changes.

Move the context switch and exit to user inlines into a separate header,
which is all that code needs.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Borislav Petkov 
Link: https://lkml.kernel.org/r/20211015011539.349132461@linutronix.de

x86/fpu: Remove pointless argument from switch_fpu_finish()

2021-10-20T13:27:25+00:00

Unused since the FPU switching rework.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Borislav Petkov 
Link: https://lkml.kernel.org/r/20211015011538.433135710@linutronix.de

x86/dumpstack: Add log_lvl to __show_regs()

2020-07-22T21:56:53+00:00

show_trace_log_lvl() provides x86 platform-specific way to unwind
backtrace with a given log level. Unfortunately, registers dump(s) are
not printed with the same log level - instead, KERN_DEFAULT is always
used.

Arista's switches uses quite common setup with rsyslog, where only
urgent messages goes to console (console_log_level=KERN_ERR), everything
else goes into /var/log/ as the console baud-rate often is indecently
slow (9600 bps).

Backtrace dumps without registers printed have proven to be as useful as
morning standups. Furthermore, in order to introduce KERN_UNSUPPRESSED
(which I believe is still the most elegant way to fix raciness of sysrq[1])
the log level should be passed down the stack to register dumping
functions. Besides, there is a potential use-case for printing traces
with KERN_DEBUG level [2] (where registers dump shouldn't appear with
higher log level).

Add log_lvl parameter to __show_regs().
Keep the used log level intact to separate visible change.

[1]: https://lore.kernel.org/lkml/20190528002412.1625-1-dima@arista.com/
[2]: https://lore.kernel.org/linux-doc/20190724170249.9644-1-dima@arista.com/

Signed-off-by: Dmitry Safonov 
Signed-off-by: Thomas Gleixner 
Acked-by: Petr Mladek 
Link: https://lkml.kernel.org/r/20200629144847.492794-3-dima@arista.com

mm: don't include asm/pgtable.h if linux/mm.h is already included

2020-06-09T16:39:13+00:00

Patch series "mm: consolidate definitions of page table accessors", v2.

The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once.  For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.

Most of these definitions are actually identical and typically it boils
down to, e.g.

static inline unsigned long pmd_index(unsigned long address)
{
        return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}

static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
        return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}

These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.

These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.

This patch (of 12):

The linux/mm.h header includes  to allow inlining of the
functions involving page table manipulations, e.g.  pte_alloc() and
pmd_alloc().  So, there is no point to explicitly include 
in the files that include .

The include statements in such cases are remove with a simple loop:

	for f in $(git grep -l "include ") ; do
		sed -i -e '/include / d' $f
	done

Signed-off-by: Mike Rapoport 
Signed-off-by: Andrew Morton 
Cc: Arnd Bergmann 
Cc: Borislav Petkov 
Cc: Brian Cain 
Cc: Catalin Marinas 
Cc: Chris Zankel 
Cc: "David S. Miller" 
Cc: Geert Uytterhoeven 
Cc: Greentime Hu 
Cc: Greg Ungerer 
Cc: Guan Xuetao 
Cc: Guo Ren 
Cc: Heiko Carstens 
Cc: Helge Deller 
Cc: Ingo Molnar 
Cc: Ley Foon Tan 
Cc: Mark Salter 
Cc: Matthew Wilcox 
Cc: Matt Turner 
Cc: Max Filippov 
Cc: Michael Ellerman 
Cc: Michal Simek 
Cc: Mike Rapoport 
Cc: Nick Hu 
Cc: Paul Walmsley 
Cc: Richard Weinberger 
Cc: Rich Felker 
Cc: Russell King 
Cc: Stafford Horne 
Cc: Thomas Bogendoerfer 
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: Vincent Chen 
Cc: Vineet Gupta 
Cc: Will Deacon 
Cc: Yoshinori Sato 
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds