<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/arch/x86/kernel/process_32.c, branch v6.9</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>x86/fpu: Clean up FPU switching in the middle of task switching</title>
<updated>2023-10-20T09:24:22+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-10-18T18:41:58+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=24b8a23638cbf92449c353f828b1d309548c78f4'/>
<id>24b8a23638cbf92449c353f828b1d309548c78f4</id>
<content type='text'>
It happens to work, but it's very very wrong, because our 'current'
macro is magic that is supposedly loading a stable value.

It just happens to be not quite stable enough and the compilers
re-load the value enough for this code to work.  But it's wrong.

The whole

        struct fpu *prev_fpu = &amp;prev-&gt;fpu;

thing in __switch_to() is pretty ugly. There's no reason why we
should look at that 'prev_fpu' pointer there, or pass it down.

And it only generates worse code, in how it loads 'current' when
__switch_to() has the right task pointers.

The attached patch not only cleans this up, it actually
generates better code too:

 (a) it removes one push/pop pair at entry/exit because there's one
     less register used (no 'current')

 (b) it removes that pointless load of 'current' because it just uses
     the right argument:

	-       movq    %gs:pcpu_hot(%rip), %r12
	-       testq   $16384, (%r12)
	+       testq   $16384, (%rdi)

Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Uros Bizjak &lt;ubizjak@gmail.com&gt;
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Link: https://lore.kernel.org/r/20231018184227.446318-1-ubizjak@gmail.com
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
It happens to work, but it's very very wrong, because our 'current'
macro is magic that is supposedly loading a stable value.

It just happens to be not quite stable enough and the compilers
re-load the value enough for this code to work.  But it's wrong.

The whole

        struct fpu *prev_fpu = &amp;prev-&gt;fpu;

thing in __switch_to() is pretty ugly. There's no reason why we
should look at that 'prev_fpu' pointer there, or pass it down.

And it only generates worse code, in how it loads 'current' when
__switch_to() has the right task pointers.

The attached patch not only cleans this up, it actually
generates better code too:

 (a) it removes one push/pop pair at entry/exit because there's one
     less register used (no 'current')

 (b) it removes that pointless load of 'current' because it just uses
     the right argument:

	-       movq    %gs:pcpu_hot(%rip), %r12
	-       testq   $16384, (%r12)
	+       testq   $16384, (%rdi)

Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Uros Bizjak &lt;ubizjak@gmail.com&gt;
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Link: https://lore.kernel.org/r/20231018184227.446318-1-ubizjak@gmail.com
</pre>
</div>
</content>
</entry>
<entry>
<title>x86/resctl: fix scheduler confusion with 'current'</title>
<updated>2023-03-08T19:48:11+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-03-07T21:06:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=7fef099702527c3b2c5234a2ea6a24411485a13a'/>
<id>7fef099702527c3b2c5234a2ea6a24411485a13a</id>
<content type='text'>
The implementation of 'current' on x86 is very intentionally special: it
is a very common thing to look up, and it uses 'this_cpu_read_stable()'
to get the current thread pointer efficiently from per-cpu storage.

And the keyword in there is 'stable': the current thread pointer never
changes as far as a single thread is concerned.  Even if when a thread
is preempted, or moved to another CPU, or even across an explicit call
'schedule()' that thread will still have the same value for 'current'.

It is, after all, the kernel base pointer to thread-local storage.
That's why it's stable to begin with, but it's also why it's important
enough that we have that special 'this_cpu_read_stable()' access for it.

So this is all done very intentionally to allow the compiler to treat
'current' as a value that never visibly changes, so that the compiler
can do CSE and combine multiple different 'current' accesses into one.

However, there is obviously one very special situation when the
currently running thread does actually change: inside the scheduler
itself.

So the scheduler code paths are special, and do not have a 'current'
thread at all.  Instead there are _two_ threads: the previous and the
next thread - typically called 'prev' and 'next' (or prev_p/next_p)
internally.

So this is all actually quite straightforward and simple, and not all
that complicated.

Except for when you then have special code that is run in scheduler
context, that code then has to be aware that 'current' isn't really a
valid thing.  Did you mean 'prev'? Did you mean 'next'?

In fact, even if then look at the code, and you use 'current' after the
new value has been assigned to the percpu variable, we have explicitly
told the compiler that 'current' is magical and always stable.  So the
compiler is quite free to use an older (or newer) value of 'current',
and the actual assignment to the percpu storage is not relevant even if
it might look that way.

Which is exactly what happened in the resctl code, that blithely used
'current' in '__resctrl_sched_in()' when it really wanted the new
process state (as implied by the name: we're scheduling 'into' that new
resctl state).  And clang would end up just using the old thread pointer
value at least in some configurations.

This could have happened with gcc too, and purely depends on random
compiler details.  Clang just seems to have been more aggressive about
moving the read of the per-cpu current_task pointer around.

The fix is trivial: just make the resctl code adhere to the scheduler
rules of using the prev/next thread pointer explicitly, instead of using
'current' in a situation where it just wasn't valid.

That same code is then also used outside of the scheduler context (when
a thread resctl state is explicitly changed), and then we will just pass
in 'current' as that pointer, of course.  There is no ambiguity in that
case.

The fix may be trivial, but noticing and figuring out what went wrong
was not.  The credit for that goes to Stephane Eranian.

Reported-by: Stephane Eranian &lt;eranian@google.com&gt;
Link: https://lore.kernel.org/lkml/20230303231133.1486085-1-eranian@google.com/
Link: https://lore.kernel.org/lkml/alpine.LFD.2.01.0908011214330.3304@localhost.localdomain/
Reviewed-by: Nick Desaulniers &lt;ndesaulniers@google.com&gt;
Tested-by: Tony Luck &lt;tony.luck@intel.com&gt;
Tested-by: Stephane Eranian &lt;eranian@google.com&gt;
Tested-by: Babu Moger &lt;babu.moger@amd.com&gt;
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The implementation of 'current' on x86 is very intentionally special: it
is a very common thing to look up, and it uses 'this_cpu_read_stable()'
to get the current thread pointer efficiently from per-cpu storage.

And the keyword in there is 'stable': the current thread pointer never
changes as far as a single thread is concerned.  Even if when a thread
is preempted, or moved to another CPU, or even across an explicit call
'schedule()' that thread will still have the same value for 'current'.

It is, after all, the kernel base pointer to thread-local storage.
That's why it's stable to begin with, but it's also why it's important
enough that we have that special 'this_cpu_read_stable()' access for it.

So this is all done very intentionally to allow the compiler to treat
'current' as a value that never visibly changes, so that the compiler
can do CSE and combine multiple different 'current' accesses into one.

However, there is obviously one very special situation when the
currently running thread does actually change: inside the scheduler
itself.

So the scheduler code paths are special, and do not have a 'current'
thread at all.  Instead there are _two_ threads: the previous and the
next thread - typically called 'prev' and 'next' (or prev_p/next_p)
internally.

So this is all actually quite straightforward and simple, and not all
that complicated.

Except for when you then have special code that is run in scheduler
context, that code then has to be aware that 'current' isn't really a
valid thing.  Did you mean 'prev'? Did you mean 'next'?

In fact, even if then look at the code, and you use 'current' after the
new value has been assigned to the percpu variable, we have explicitly
told the compiler that 'current' is magical and always stable.  So the
compiler is quite free to use an older (or newer) value of 'current',
and the actual assignment to the percpu storage is not relevant even if
it might look that way.

Which is exactly what happened in the resctl code, that blithely used
'current' in '__resctrl_sched_in()' when it really wanted the new
process state (as implied by the name: we're scheduling 'into' that new
resctl state).  And clang would end up just using the old thread pointer
value at least in some configurations.

This could have happened with gcc too, and purely depends on random
compiler details.  Clang just seems to have been more aggressive about
moving the read of the per-cpu current_task pointer around.

The fix is trivial: just make the resctl code adhere to the scheduler
rules of using the prev/next thread pointer explicitly, instead of using
'current' in a situation where it just wasn't valid.

That same code is then also used outside of the scheduler context (when
a thread resctl state is explicitly changed), and then we will just pass
in 'current' as that pointer, of course.  There is no ambiguity in that
case.

The fix may be trivial, but noticing and figuring out what went wrong
was not.  The credit for that goes to Stephane Eranian.

Reported-by: Stephane Eranian &lt;eranian@google.com&gt;
Link: https://lore.kernel.org/lkml/20230303231133.1486085-1-eranian@google.com/
Link: https://lore.kernel.org/lkml/alpine.LFD.2.01.0908011214330.3304@localhost.localdomain/
Reviewed-by: Nick Desaulniers &lt;ndesaulniers@google.com&gt;
Tested-by: Tony Luck &lt;tony.luck@intel.com&gt;
Tested-by: Stephane Eranian &lt;eranian@google.com&gt;
Tested-by: Babu Moger &lt;babu.moger@amd.com&gt;
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>x86/percpu: Move current_top_of_stack next to current_task</title>
<updated>2022-10-17T14:41:05+00:00</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2022-09-15T11:11:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=c063a217bc0726c2560138229de5673dbb253a02'/>
<id>c063a217bc0726c2560138229de5673dbb253a02</id>
<content type='text'>
Extend the struct pcpu_hot cacheline with current_top_of_stack;
another very frequently used value.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/r/20220915111145.493038635@infradead.org
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Extend the struct pcpu_hot cacheline with current_top_of_stack;
another very frequently used value.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/r/20220915111145.493038635@infradead.org
</pre>
</div>
</content>
</entry>
<entry>
<title>x86: Put hot per CPU variables into a struct</title>
<updated>2022-10-17T14:41:03+00:00</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2022-09-15T11:11:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=e57ef2ed97c1d078973298658a8096644a1e9e09'/>
<id>e57ef2ed97c1d078973298658a8096644a1e9e09</id>
<content type='text'>
The layout of per-cpu variables is at the mercy of the compiler. This
can lead to random performance fluctuations from build to build.

Create a structure to hold some of the hottest per-cpu variables,
starting with current_task.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/r/20220915111145.179707194@infradead.org
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The layout of per-cpu variables is at the mercy of the compiler. This
can lead to random performance fluctuations from build to build.

Create a structure to hold some of the hottest per-cpu variables,
starting with current_task.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Link: https://lore.kernel.org/r/20220915111145.179707194@infradead.org
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge tag 'x86_core_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2022-05-24T01:42:07+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2022-05-24T01:42:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=de8ac81747fca15925f3488ead7804560cdea532'/>
<id>de8ac81747fca15925f3488ead7804560cdea532</id>
<content type='text'>
Pull core x86 updates from Borislav Petkov:

 - Remove all the code around GS switching on 32-bit now that it is not
   needed anymore

 - Other misc improvements

* tag 'x86_core_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  bug: Use normal relative pointers in 'struct bug_entry'
  x86/nmi: Make register_nmi_handler() more robust
  x86/asm: Merge load_gs_index()
  x86/32: Remove lazy GS macros
  ELF: Remove elf_core_copy_kernel_regs()
  x86/32: Simplify ELF_CORE_COPY_REGS
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull core x86 updates from Borislav Petkov:

 - Remove all the code around GS switching on 32-bit now that it is not
   needed anymore

 - Other misc improvements

* tag 'x86_core_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  bug: Use normal relative pointers in 'struct bug_entry'
  x86/nmi: Make register_nmi_handler() more robust
  x86/asm: Merge load_gs_index()
  x86/32: Remove lazy GS macros
  ELF: Remove elf_core_copy_kernel_regs()
  x86/32: Simplify ELF_CORE_COPY_REGS
</pre>
</div>
</content>
</entry>
<entry>
<title>x86/prctl: Remove pointless task argument</title>
<updated>2022-05-13T10:56:28+00:00</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2022-05-12T12:04:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=f5c0b4f30416c670408a77be94703d04d22b57df'/>
<id>f5c0b4f30416c670408a77be94703d04d22b57df</id>
<content type='text'>
The functions invoked via do_arch_prctl_common() can only operate on
the current task and none of these function uses the task argument.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Borislav Petkov &lt;bp@suse.de&gt;
Link: https://lore.kernel.org/r/87lev7vtxj.ffs@tglx
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The functions invoked via do_arch_prctl_common() can only operate on
the current task and none of these function uses the task argument.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Borislav Petkov &lt;bp@suse.de&gt;
Link: https://lore.kernel.org/r/87lev7vtxj.ffs@tglx
</pre>
</div>
</content>
</entry>
<entry>
<title>x86/32: Remove lazy GS macros</title>
<updated>2022-04-14T12:09:43+00:00</updated>
<author>
<name>Brian Gerst</name>
<email>brgerst@gmail.com</email>
</author>
<published>2022-03-25T15:39:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=3a24a60854d2ef19e0edcd11cdbbb4fabc655dde'/>
<id>3a24a60854d2ef19e0edcd11cdbbb4fabc655dde</id>
<content type='text'>
GS is always a user segment now.

Signed-off-by: Brian Gerst &lt;brgerst@gmail.com&gt;
Signed-off-by: Borislav Petkov &lt;bp@suse.de&gt;
Reviewed-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Acked-by: Andy Lutomirski &lt;luto@kernel.org&gt;
Link: https://lore.kernel.org/r/20220325153953.162643-4-brgerst@gmail.com
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
GS is always a user segment now.

Signed-off-by: Brian Gerst &lt;brgerst@gmail.com&gt;
Signed-off-by: Borislav Petkov &lt;bp@suse.de&gt;
Reviewed-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Acked-by: Andy Lutomirski &lt;luto@kernel.org&gt;
Link: https://lore.kernel.org/r/20220325153953.162643-4-brgerst@gmail.com
</pre>
</div>
</content>
</entry>
<entry>
<title>x86/fpu: Move context switch and exit to user inlines into sched.h</title>
<updated>2021-10-20T13:27:27+00:00</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2021-10-15T01:16:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=63e81807c1f94e91b9d71c536112a40cd74bab85'/>
<id>63e81807c1f94e91b9d71c536112a40cd74bab85</id>
<content type='text'>
internal.h is a kitchen sink which needs to get out of the way to prepare
for the upcoming changes.

Move the context switch and exit to user inlines into a separate header,
which is all that code needs.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Borislav Petkov &lt;bp@suse.de&gt;
Link: https://lkml.kernel.org/r/20211015011539.349132461@linutronix.de
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
internal.h is a kitchen sink which needs to get out of the way to prepare
for the upcoming changes.

Move the context switch and exit to user inlines into a separate header,
which is all that code needs.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Borislav Petkov &lt;bp@suse.de&gt;
Link: https://lkml.kernel.org/r/20211015011539.349132461@linutronix.de
</pre>
</div>
</content>
</entry>
<entry>
<title>x86/fpu: Remove pointless argument from switch_fpu_finish()</title>
<updated>2021-10-20T13:27:25+00:00</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2021-10-15T01:15:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=9568bfb4f04bd9a280c592879ccd7a26a77c1390'/>
<id>9568bfb4f04bd9a280c592879ccd7a26a77c1390</id>
<content type='text'>
Unused since the FPU switching rework.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Borislav Petkov &lt;bp@suse.de&gt;
Link: https://lkml.kernel.org/r/20211015011538.433135710@linutronix.de
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Unused since the FPU switching rework.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Borislav Petkov &lt;bp@suse.de&gt;
Link: https://lkml.kernel.org/r/20211015011538.433135710@linutronix.de
</pre>
</div>
</content>
</entry>
<entry>
<title>x86/dumpstack: Add log_lvl to __show_regs()</title>
<updated>2020-07-22T21:56:53+00:00</updated>
<author>
<name>Dmitry Safonov</name>
<email>dima@arista.com</email>
</author>
<published>2020-06-29T14:48:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=44e215352cf17333992d56941b5bf4af60a67609'/>
<id>44e215352cf17333992d56941b5bf4af60a67609</id>
<content type='text'>
show_trace_log_lvl() provides x86 platform-specific way to unwind
backtrace with a given log level. Unfortunately, registers dump(s) are
not printed with the same log level - instead, KERN_DEFAULT is always
used.

Arista's switches uses quite common setup with rsyslog, where only
urgent messages goes to console (console_log_level=KERN_ERR), everything
else goes into /var/log/ as the console baud-rate often is indecently
slow (9600 bps).

Backtrace dumps without registers printed have proven to be as useful as
morning standups. Furthermore, in order to introduce KERN_UNSUPPRESSED
(which I believe is still the most elegant way to fix raciness of sysrq[1])
the log level should be passed down the stack to register dumping
functions. Besides, there is a potential use-case for printing traces
with KERN_DEBUG level [2] (where registers dump shouldn't appear with
higher log level).

Add log_lvl parameter to __show_regs().
Keep the used log level intact to separate visible change.

[1]: https://lore.kernel.org/lkml/20190528002412.1625-1-dima@arista.com/
[2]: https://lore.kernel.org/linux-doc/20190724170249.9644-1-dima@arista.com/

Signed-off-by: Dmitry Safonov &lt;dima@arista.com&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Acked-by: Petr Mladek &lt;pmladek@suse.com&gt;
Link: https://lkml.kernel.org/r/20200629144847.492794-3-dima@arista.com

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
show_trace_log_lvl() provides x86 platform-specific way to unwind
backtrace with a given log level. Unfortunately, registers dump(s) are
not printed with the same log level - instead, KERN_DEFAULT is always
used.

Arista's switches uses quite common setup with rsyslog, where only
urgent messages goes to console (console_log_level=KERN_ERR), everything
else goes into /var/log/ as the console baud-rate often is indecently
slow (9600 bps).

Backtrace dumps without registers printed have proven to be as useful as
morning standups. Furthermore, in order to introduce KERN_UNSUPPRESSED
(which I believe is still the most elegant way to fix raciness of sysrq[1])
the log level should be passed down the stack to register dumping
functions. Besides, there is a potential use-case for printing traces
with KERN_DEBUG level [2] (where registers dump shouldn't appear with
higher log level).

Add log_lvl parameter to __show_regs().
Keep the used log level intact to separate visible change.

[1]: https://lore.kernel.org/lkml/20190528002412.1625-1-dima@arista.com/
[2]: https://lore.kernel.org/linux-doc/20190724170249.9644-1-dima@arista.com/

Signed-off-by: Dmitry Safonov &lt;dima@arista.com&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Acked-by: Petr Mladek &lt;pmladek@suse.com&gt;
Link: https://lkml.kernel.org/r/20200629144847.492794-3-dima@arista.com

</pre>
</div>
</content>
</entry>
</feed>
