linux-stable.git/drivers/idle, branch linux-6.10.y

cpuidle: ACPI/intel: fix MWAIT hint target C-state computation

2024-03-05T20:25:18+00:00

According to x86 spec ([1] and [2]), MWAIT hint_address[7:4] plus 1 is
the corresponding C-state, and 0xF means C0.

ACPI C-state table usually only contains C1+, but nothing prevents ACPI
firmware from presenting a C-state (maybe C1+) but using MWAIT address C0
(i.e., 0xF in ACPI FFH MWAIT hint address). And if this is the case, Linux
erroneously treat this cstate as C16, while actually this should be valid
C0 instead of C16, as per the specifications.

Since ACPI firmware is out of Linux kernel scope, fix the kernel handling
of 0xF ->(to) C0 in this situation. This is found when a tweaked ACPI
C-state table is presented by Qemu to VM.

Also modify the intel_idle case for code consistency.

[1]. Intel SDM Vol 2, Table 4-11. MWAIT Hints
Register (EAX): "Value of 0 means C1; 1 means C2 and so on
Value of 01111B means C0".

[2]. AMD manual Vol 3, MWAIT: "The processor C-state is EAX[7:4]+1, so to
request C0 is to place the value F in EAX[7:4] and to request C1 is to
place the value 0 in EAX[7:4].".

Signed-off-by: He Rongguang 
[ rjw: Subject and changelog edits, whitespace fixups ]
Signed-off-by: Rafael J. Wysocki

Merge tag 'pm-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

2024-01-10T00:32:11+00:00

Pull power management updates from Rafael Wysocki:
 "These add support for new processors (Sierra Forest, Grand Ridge and
  Meteor Lake) to the intel_idle driver, make intel_pstate run on
  Emerald Rapids without HWP support and adjust it to utilize EPP values
  supplied by the platform firmware, fix issues, clean up code and
  improve documentation.

  The most significant fix addresses deadlocks in the core system-wide
  resume code that occur if async_schedule_dev() attempts to run its
  argument function synchronously (for example, due to a memory
  allocation failure). It rearranges the code in question which may
  increase the system resume time in some cases, but this basically is a
  removal of a premature optimization. That optimization will be added
  back later, but properly this time.

  Specifics:

   - Add support for the Sierra Forest, Grand Ridge and Meteorlake SoCs
     to the intel_idle cpuidle driver (Artem Bityutskiy, Zhang Rui)

   - Do not enable interrupts when entering idle in the haltpoll cpuidle
     driver (Borislav Petkov)

   - Add Emerald Rapids support in no-HWP mode to the intel_pstate
     cpufreq driver (Zhenguo Yao)

   - Use EPP values programmed by the platform firmware as balanced
     performance ones by default in intel_pstate (Srinivas Pandruvada)

   - Add a missing function return value check to the SCMI cpufreq
     driver to avoid unexpected behavior (Alexandra Diupina)

   - Fix parameter type warning in the armada-8k cpufreq driver (Gregory
     CLEMENT)

   - Rework trans_stat_show() in the devfreq core code to avoid buffer
     overflows (Christian Marangi)

   - Synchronize devfreq_monitor_[start/stop] so as to prevent a timer
     list corruption from occurring when devfreq governors are switched
     frequently (Mukesh Ojha)

   - Fix possible deadlocks in the core system-wide PM code that occur
     if device-handling functions cannot be executed asynchronously
     during resume from system-wide suspend (Rafael J. Wysocki)

   - Clean up unnecessary local variable initializations in multiple
     places in the hibernation code (Wang chaodong, Li zeming)

   - Adjust core hibernation code to avoid missing wakeup events that
     occur after saving an image to persistent storage (Chris Feng)

   - Update hibernation code to enforce correct ordering during image
     compression and decompression (Hongchen Zhang)

   - Use kmap_local_page() instead of kmap_atomic() in copy_data_page()
     during hibernation and restore (Chen Haonan)

   - Adjust documentation and code comments to reflect recent tasks
     freezer changes (Kevin Hao)

   - Repair excess function parameter description warning in the
     hibernation image-saving code (Randy Dunlap)

   - Fix _set_required_opps when opp is NULL (Bryan O'Donoghue)

   - Use device_get_match_data() in the OPP code for TI (Rob Herring)

   - Clean up OPP level and other parts and call dev_pm_opp_set_opp()
     recursively for required OPPs (Viresh Kumar)"

* tag 'pm-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (35 commits)
  OPP: Rename 'rate_clk_single'
  OPP: Pass rounded rate to _set_opp()
  OPP: Relocate dev_pm_opp_sync_regulators()
  PM: sleep: Fix possible deadlocks in core system-wide PM code
  OPP: Move dev_pm_opp_icc_bw to internal opp.h
  async: Introduce async_schedule_dev_nocall()
  async: Split async_schedule_node_domain()
  cpuidle: haltpoll: Do not enable interrupts when entering idle
  OPP: Fix _set_required_opps when opp is NULL
  OPP: The level field is always of unsigned int type
  PM: hibernate: Repair excess function parameter description warning
  PM: sleep: Remove obsolete comment from unlock_system_sleep()
  cpufreq: intel_pstate: Add Emerald Rapids support in no-HWP mode
  Documentation: PM: Adjust freezing-of-tasks.rst to the freezer changes
  PM: hibernate: Use kmap_local_page() in copy_data_page()
  intel_idle: add Sierra Forest SoC support
  intel_idle: add Grand Ridge SoC support
  PM / devfreq: Synchronize devfreq_monitor_[start/stop]
  cpufreq: armada-8k: Fix parameter type warning
  PM: hibernate: Enforce ordering during image compression/decompression
  ...

intel_idle: add Sierra Forest SoC support

2023-12-19T17:56:45+00:00

Add Sierra Forest SoC C-states, which are C1, C1E, C6S, and C6SP.

Sierra Forest SoC is built with modules, each module includes 4 cores
(Crestmont microarchitecture). There is one L2 cache per module, shared
between the 4 cores.

There is no core C6 state, but there is C6S state, which has module scope:
when all 4 cores request C6S, the entire module (4 cores + L2 cache)
enters the low power state.

C6SP state has package scope - when all modules in the package enter C6S,
the package enters the power state mode.

Signed-off-by: Artem Bityutskiy 
Signed-off-by: Rafael J. Wysocki

intel_idle: add Grand Ridge SoC support

2023-12-19T17:56:45+00:00

Add Intel Grand Ridge SoC C-states, which are C1, C1E, and C6S.

The Grand Ridge SoC is built with modules, each module includes 4 cores
(Crestmont microarchitecture). There is one L2 cache per module, shared
between the 4 cores.

There is no core C6 state, but there is C6S state, which has module
scope: when all 4 cores request C6S, the entire module (4 cores + L2
cache) enters the low power state.

Package C6 is not supported by Grand Ridge SoC.

Signed-off-by: Artem Bityutskiy 
Signed-off-by: Rafael J. Wysocki

intel_idle: Add Meteorlake support

2023-12-11T19:58:27+00:00

Add intel_idle support for MeteorLake.

C1 and C1E states on Meteorlake are mutually exclusive, like Alderlake
and Raptorlake, but they have little latency difference with measureable
power difference, so always enable "C1E promotion" bit and expose C1E
only.

Expose C6 because it has less power compared with C1E, and smaller
latency compared with C8/C10.

Ignore C8 and expose C10, because C8 does not show latency advantage
compared with C10.

Signed-off-by: Zhang Rui 
Signed-off-by: Rafael J. Wysocki

x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer reprogram

2023-11-29T14:44:01+00:00

intel_idle_irq() re-enables IRQs very early. As a result, an interrupt
may fire before mwait() is eventually called. If such an interrupt queues
a timer, it may go unnoticed until mwait returns and the idle loop
handles the tick re-evaluation. And monitoring TIF_NEED_RESCHED doesn't
help because a local timer enqueue doesn't set that flag.

The issue is mitigated by the fact that this idle handler is only invoked
for shallow C-states when, presumably, the next tick is supposed to be
close enough. There may still be rare cases though when the next tick
is far away and the selected C-state is shallow, resulting in a timer
getting ignored for a while.

Fix this with using sti_mwait() whose IRQ-reenablement only triggers
upon calling mwait(), dealing with the race while keeping the interrupt
latency within acceptable bounds.

Fixes: c227233ad64c (intel_idle: enable interrupts before C1 on Xeons)
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Link: https://lkml.kernel.org/r/20231115151325.6262-3-frederic@kernel.org

intel_idle: Add ibrs_off module parameter to force-disable IBRS

2023-10-07T09:33:28+00:00

Commit bf5835bcdb96 ("intel_idle: Disable IBRS during long idle")
disables IBRS when the cstate is 6 or lower. However, there are
some use cases where a customer may want to use max_cstate=1 to
lower latency. Such use cases will suffer from the performance
degradation caused by the enabling of IBRS in the sibling idle thread.
Add a "ibrs_off" module parameter to force disable IBRS and the
CPUIDLE_FLAG_IRQ_ENABLE flag if set.

In the case of a Skylake server with max_cstate=1, this new ibrs_off
option will likely increase the IRQ response latency as IRQ will now
be disabled.

When running SPECjbb2015 with cstates set to C1 on a Skylake system.

First test when the kernel is booted with: "intel_idle.ibrs_off":

  max-jOPS = 117828, critical-jOPS = 66047

Then retest when the kernel is booted without the "intel_idle.ibrs_off"
added:

  max-jOPS = 116408, critical-jOPS = 58958

That means booting with "intel_idle.ibrs_off" improves performance by:

  max-jOPS:      +1.2%, which could be considered noise range.
  critical-jOPS: +12%,  which is definitely a solid improvement.

The admin-guide/pm/intel_idle.rst file is updated to add a description
about the new "ibrs_off" module parameter.

Signed-off-by: Waiman Long 
Signed-off-by: Ingo Molnar 
Acked-by: Rafael J. Wysocki 
Cc: Linus Torvalds 
Link: https://lore.kernel.org/r/20230727184600.26768-5-longman@redhat.com

intel_idle: Use __update_spec_ctrl() in intel_idle_ibrs()

2023-10-07T09:33:28+00:00

When intel_idle_ibrs() is called, it modifies the SPEC_CTRL MSR to 0
in order disable IBRS. However, the new MSR value isn't reflected in
x86_spec_ctrl_current which is at odd with the other code that keep track
of its state in that percpu variable.  Use the new __update_spec_ctrl()
to have the x86_spec_ctrl_current percpu value properly updated.

Since spec-ctrl.h includes both msr.h and nospec-branch.h, we can remove
those from the include file list.

Signed-off-by: Waiman Long 
Signed-off-by: Ingo Molnar 
Acked-by: Rafael J. Wysocki 
Cc: Linus Torvalds 
Link: https://lore.kernel.org/r/20230727184600.26768-4-longman@redhat.com

Merge tag 'perf-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2023-08-28T23:35:01+00:00

Pull perf event updates from Ingo Molnar:

 - AMD IBS improvements

 - Intel PMU driver updates

 - Extend core perf facilities & the ARM PMU driver to better handle ARM big.LITTLE events

 - Micro-optimize software events and the ring-buffer code

 - Misc cleanups & fixes

* tag 'perf-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/uncore: Remove unnecessary ?: operator around pcibios_err_to_errno() call
  perf/x86/intel: Add Crestmont PMU
  x86/cpu: Update Hybrids
  x86/cpu: Fix Crestmont uarch
  x86/cpu: Fix Gracemont uarch
  perf: Remove unused extern declaration arch_perf_get_page_size()
  perf: Remove unused PERF_PMU_CAP_HETEROGENEOUS_CPUS capability
  arm_pmu: Remove unused PERF_PMU_CAP_HETEROGENEOUS_CPUS capability
  perf/x86: Remove unused PERF_PMU_CAP_HETEROGENEOUS_CPUS capability
  arm_pmu: Add PERF_PMU_CAP_EXTENDED_HW_TYPE capability
  perf/x86/ibs: Set mem_lvl_num, mem_remote and mem_hops for data_src
  perf/mem: Add PERF_MEM_LVLNUM_NA to PERF_MEM_NA
  perf/mem: Introduce PERF_MEM_LVLNUM_UNC
  perf/ring_buffer: Use local_try_cmpxchg in __perf_output_begin
  locking/arch: Avoid variable shadowing in local_try_cmpxchg()
  perf/core: Use local64_try_cmpxchg in perf_swevent_set_period
  perf/x86: Use local64_try_cmpxchg
  perf/amd: Prevent grouping of IBS events

x86/cpu: Fix Gracemont uarch

2023-08-09T19:51:06+00:00

Alderlake N is an E-core only product using Gracemont
micro-architecture. It fits the pre-existing naming scheme perfectly
fine, adhere to it.

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rafael J. Wysocki 
Acked-by: Hans de Goede 
Link: https://lore.kernel.org/r/20230807150405.686834933@infradead.org