| Age | Commit message (Collapse) | Author |
|
Panther cove µarch starts to support auto counter reload (ACR), but the
static_call intel_pmu_enable_acr_event() is not updated for the Panther
Cove µarch used by DMR. It leads to the auto counter reload is not
really enabled on DMR.
Update static_call intel_pmu_enable_acr_event() in intel_pmu_init_pnc().
Fixes: d345b6bb8860 ("perf/x86/intel: Add core PMU support for DMR")
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260430002558.712334-5-dapeng1.mi@linux.intel.com
|
|
On platforms with Auto Counter Reload (ACR) support, such as NVL, a
"NMI received for unknown reason 30" warning is observed when running
multiple events in a group with ACR enabled:
$ perf record -e '{instructions/period=20000,acr_mask=0x2/u,\
cycles/period=40000,acr_mask=0x3/u}' ./test
The warning occurs because the Performance Monitoring Interrupt (PMI)
is enabled for the self-reloaded event (the cycles event in this case).
According to the Intel SDM, the overflow bit
(IA32_PERF_GLOBAL_STATUS.PMCn_OVF) is never set for self-reloaded events.
Since the bit is not set, the perf NMI handler cannot identify the source
of the interrupt, leading to the "unknown reason" message.
Furthermore, enabling PMI for self-reloaded events is unnecessary and
can lead to extraneous records that pollute the user's requested data.
Disable the interrupt bit for all events configured with ACR self-reload.
Fixes: ec980e4facef ("perf/x86/intel: Support auto counter reload")
Reported-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260430002558.712334-4-dapeng1.mi@linux.intel.com
|
|
Members of an ACR group are logically linked via a bitmask of their
hardware counter indices. If some members of the group are assigned new
hardware counters during rescheduling, even events that keep their
original counter index must be updated with a new mask.
Without this, an event will continue to use a stale acr_mask that
references the old indices of its group peers. Ensure all ACR events are
reprogrammed during the scheduling path to maintain consistency across
the group.
Fixes: ec980e4facef ("perf/x86/intel: Support auto counter reload")
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260430002558.712334-3-dapeng1.mi@linux.intel.com
|
|
Currently there are several issues on the user space ACR mask validation
and configuration.
- The validation for user space ACR mask (attr.config2) is incomplete,
e.g., the ACR mask could include the index which belongs to another
ACR events group, but it's not validated.
- An early return on an invalid ACR mask caused all subsequent ACR groups
to be skipped.
- The stale hardware ACR mask (hw.config1) is not cleared before setting
new hardware ACR mask.
The following changes address all of the above issues.
- Figure out the event index group of an ACR group. Any bits in the
user-space mask not present in the index group are now dropped.
- Instead of an early return on invalid bits, drop only the invalid
portions and continue iterating through all ACR events to ensure full
configuration.
- Explicitly clear the stale hardware ACR mask for each event prior to
writing the new configuration.
Besides, a non-leader event member of ACR group could be disabled in
theory. This could cause bit-shifting errors in the acr_mask of remaining
group members. But since ACR sampling requires all events to be active,
this should not be a big concern in real use case. Add a "FIXME" comment
to notice this risk.
Fixes: ec980e4facef ("perf/x86/intel: Support auto counter reload")
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260430002558.712334-2-dapeng1.mi@linux.intel.com
|
|
Pull kvm updates from Paolo Bonzini:
"Arm:
- Add support for tracing in the standalone EL2 hypervisor code,
which should help both debugging and performance analysis. This
uses the new infrastructure for 'remote' trace buffers that can be
exposed by non-kernel entities such as firmware, and which came
through the tracing tree
- Add support for GICv5 Per Processor Interrupts (PPIs), as the
starting point for supporting the new GIC architecture in KVM
- Finally add support for pKVM protected guests, where pages are
unmapped from the host as they are faulted into the guest and can
be shared back from the guest using pKVM hypercalls. Protected
guests are created using a new machine type identifier. As the
elusive guestmem has not yet delivered on its promises, anonymous
memory is also supported
This is only a first step towards full isolation from the host; for
example, the CPU register state and DMA accesses are not yet
isolated. Because this does not really yet bring fully what it
promises, it is hidden behind CONFIG_ARM_PKVM_GUEST +
'kvm-arm.mode=protected', and also triggers TAINT_USER when a VM is
created. Caveat emptor
- Rework the dreaded user_mem_abort() function to make it more
maintainable, reducing the amount of state being exposed to the
various helpers and rendering a substantial amount of state
immutable
- Expand the Stage-2 page table dumper to support NV shadow page
tables on a per-VM basis
- Tidy up the pKVM PSCI proxy code to be slightly less hard to
follow
- Fix both SPE and TRBE in non-VHE configurations so that they do not
generate spurious, out of context table walks that ultimately lead
to very bad HW lockups
- A small set of patches fixing the Stage-2 MMU freeing in error
cases
- Tighten-up accepted SMC immediate value to be only #0 for host
SMCCC calls
- The usual cleanups and other selftest churn
LoongArch:
- Use CSR_CRMD_PLV for kvm_arch_vcpu_in_kernel()
- Add DMSINTC irqchip in kernel support
RISC-V:
- Fix steal time shared memory alignment checks
- Fix vector context allocation leak
- Fix array out-of-bounds in pmu_ctr_read() and pmu_fw_ctr_read_hi()
- Fix double-free of sdata in kvm_pmu_clear_snapshot_area()
- Fix integer overflow in kvm_pmu_validate_counter_mask()
- Fix shift-out-of-bounds in make_xfence_request()
- Fix lost write protection on huge pages during dirty logging
- Split huge pages during fault handling for dirty logging
- Skip CSR restore if VCPU is reloaded on the same core
- Implement kvm_arch_has_default_irqchip() for KVM selftests
- Factored-out ISA checks into separate sources
- Added hideleg to struct kvm_vcpu_config
- Factored-out VCPU config into separate sources
- Support configuration of per-VM HGATP mode from KVM user space
s390:
- Support for ESA (31-bit) guests inside nested hypervisors
- Remove restriction on memslot alignment, which is not needed
anymore with the new gmap code
- Fix LPSW/E to update the bear (which of course is the breaking
event address register)
x86:
- Shut up various UBSAN warnings on reading module parameter before
they were initialized
- Don't zero-allocate page tables that are used for splitting
hugepages in the TDP MMU, as KVM is guaranteed to set all SPTEs in
the page table and thus write all bytes
- As an optimization, bail early when trying to unsync 4KiB mappings
if the target gfn can just be mapped with a 2MiB hugepage
x86 generic:
- Copy single-chunk MMIO write values into struct kvm_vcpu (more
precisely struct kvm_mmio_fragment) to fix use-after-free stack
bugs where KVM would dereference stack pointer after an exit to
userspace
- Clean up and comment the emulated MMIO code to try to make it
easier to maintain (not necessarily "easy", but "easier")
- Move VMXON+VMXOFF and EFER.SVME toggling out of KVM (not *all* of
VMX and SVM enabling) as it is needed for trusted I/O
- Advertise support for AVX512 Bit Matrix Multiply (BMM) instructions
- Immediately fail the build if a required #define is missing in one
of KVM's headers that is included multiple times
- Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected
exception, mostly to prevent syzkaller from abusing the uAPI to
trigger WARNs, but also because it can help prevent userspace from
unintentionally crashing the VM
- Exempt SMM from CPUID faulting on Intel, as per the spec
- Misc hardening and cleanup changes
x86 (AMD):
- Fix and optimize IRQ window inhibit handling for AVIC; make it
per-vCPU so that KVM doesn't prematurely re-enable AVIC if multiple
vCPUs have to-be-injected IRQs
- Clean up and optimize the OSVW handling, avoiding a bug in which
KVM would overwrite state when enabling virtualization on multiple
CPUs in parallel. This should not be a problem because OSVW should
usually be the same for all CPUs
- Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains
about a "too large" size based purely on user input
- Clean up and harden the pinning code for KVM_MEMORY_ENCRYPT_REG_REGION
- Disallow synchronizing a VMSA of an already-launched/encrypted
vCPU, as doing so for an SNP guest will crash the host due to an
RMP violation page fault
- Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped
queries are required to hold kvm->lock, and enforce it by lockdep.
Fix various bugs where sev_guest() was not ensured to be stable for
the whole duration of a function or ioctl
- Convert a pile of kvm->lock SEV code to guard()
- Play nicer with userspace that does not enable
KVM_CAP_EXCEPTION_PAYLOAD, for which KVM needs to set CR2 and DR6
as a response to ioctls such as KVM_GET_VCPU_EVENTS (even if the
payload would end up in EXITINFO2 rather than CR2, for example).
Only set CR2 and DR6 when consumption of the payload is imminent,
but on the other hand force delivery of the payload in all paths
where userspace retrieves CR2 or DR6
- Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT
instead of vmcb02->save.cr2. The value is out of sync after a
save/restore or after a #PF is injected into L2
- Fix a class of nSVM bugs where some fields written by the CPU are
not synchronized from vmcb02 to cached vmcb12 after VMRUN, and so
are not up-to-date when saved by KVM_GET_NESTED_STATE
- Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE
and KVM_SET_{S}REGS could cause vmcb02 to be incorrectly
initialized after save+restore
- Add a variety of missing nSVM consistency checks
- Fix several bugs where KVM failed to correctly update VMCB fields
on nested #VMEXIT
- Fix several bugs where KVM failed to correctly synthesize #UD or
#GP for SVM-related instructions
- Add support for save+restore of virtualized LBRs (on SVM)
- Refactor various helpers and macros to improve clarity and
(hopefully) make the code easier to maintain
- Aggressively sanitize fields when copying from vmcb12, to guard
against unintentionally allowing L1 to utilize yet-to-be-defined
features
- Fix several bugs where KVM botched rAX legality checks when
emulating SVM instructions. There are remaining issues in that KVM
doesn't handle size prefix overrides for 64-bit guests
- Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails
instead of somewhat arbitrarily synthesizing #GP (i.e. don't double
down on AMD's architectural but sketchy behavior of generating #GP
for "unsupported" addresses)
- Cache all used vmcb12 fields to further harden against TOCTOU bugs
x86 (Intel):
- Drop obsolete branch hint prefixes from the VMX instruction macros
- Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a
register input when appropriate
- Code cleanups
guest_memfd:
- Don't mark guest_memfd folios as accessed, as guest_memfd doesn't
support reclaim, the memory is unevictable, and there is no storage
to write back to
LoongArch selftests:
- Add KVM PMU test cases
s390 selftests:
- Enable more memory selftests
x86 selftests:
- Add support for Hygon CPUs in KVM selftests
- Fix a bug in the MSR test where it would get false failures on
AMD/Hygon CPUs with exactly one of RDPID or RDTSCP
- Add an MADV_COLLAPSE testcase for guest_memfd as a regression test
for a bug where the kernel would attempt to collapse guest_memfd
folios against KVM's will"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (373 commits)
KVM: x86: use inlines instead of macros for is_sev_*guest
x86/virt: Treat SVM as unsupported when running as an SEV+ guest
KVM: SEV: Goto an existing error label if charging misc_cg for an ASID fails
KVM: SVM: Move lock-protected allocation of SEV ASID into a separate helper
KVM: SEV: use mutex guard in snp_handle_guest_req()
KVM: SEV: use mutex guard in sev_mem_enc_unregister_region()
KVM: SEV: use mutex guard in sev_mem_enc_ioctl()
KVM: SEV: use mutex guard in snp_launch_update()
KVM: SEV: Assert that kvm->lock is held when querying SEV+ support
KVM: SEV: Document that checking for SEV+ guests when reclaiming memory is "safe"
KVM: SEV: Hide "struct kvm_sev_info" behind CONFIG_KVM_AMD_SEV=y
KVM: SEV: WARN on unhandled VM type when initializing VM
KVM: LoongArch: selftests: Add PMU overflow interrupt test
KVM: LoongArch: selftests: Add basic PMU event counting test
KVM: LoongArch: selftests: Add cpucfg read/write helpers
LoongArch: KVM: Add DMSINTC inject msi to vCPU
LoongArch: KVM: Add DMSINTC device support
LoongArch: KVM: Make vcpu_is_preempted() as a macro rather than function
LoongArch: KVM: Move host CSR_GSTAT save and restore in context switch
LoongArch: KVM: Move host CSR_EENTRY save and restore in context switch
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
- Consolidate AMD and Hygon cases in parse_topology() (Wei Wang)
- asm constraints cleanups in __iowrite32_copy() (Uros Bizjak)
- Drop AMD Extended Interrupt LVT macros (Naveen N Rao)
- Don't use REALLY_SLOW_IO for delays (Juergen Gross)
- paravirt cleanups (Juergen Gross)
- FPU code cleanups (Borislav Petkov)
- split-lock handling code cleanups (Borislav Petkov, Ronan Pigott)
* tag 'x86-cleanups-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Correct the comment explaining what xfeatures_in_use() does
x86/split_lock: Don't warn about unknown split_lock_detect parameter
x86/fpu: Correct misspelled xfeaures_to_write local var
x86/apic: Drop AMD Extended Interrupt LVT macros
x86/cpu/topology: Consolidate AMD and Hygon cases in parse_topology()
block/floppy: Don't use REALLY_SLOW_IO for delays
x86/paravirt: Replace io_delay() hook with a bool
x86/irqflags: Preemptively move include paravirt.h directive where it belongs
x86/split_lock: Restructure the unwieldy switch-case in sld_state_show()
x86/local: Remove trailing semicolon from _ASM_XADD in local_add_return()
x86/asm: Use inout "+" asm onstraint modifiers in __iowrite32_copy()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance events updates from Ingo Molnar:
"Core updates:
- Try to allocate task_ctx_data quickly, to optimize O(N^2) algorithm
on large systems with O(100k) threads (Namhyung Kim)
AMD PMU driver IBS support updates and fixes, by Ravi Bangoria:
- Fix interrupt accounting for discarded samples
- Fix a Zen5-specific quirk
- Fix PhyAddrVal handling
- Fix NMI-safety with perf_allow_kernel()
- Fix a race between event add and NMIs
Intel PMU driver updates:
- Only check GP counters for PEBS constraints validation (Dapeng Mi)
MSR driver:
- Turn SMI_COUNT and PPERF on by default, instead of a long list of
CPU models to enable them on (Kan Liang)
... and misc cleanups and fixes by Aldf Conte, Anshuman Khandual,
Namhyung Kim, Ravi Bangoria and Yen-Hsiang Hsu"
* tag 'perf-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/events: Replace READ_ONCE() with standard pgtable accessors
perf/x86/msr: Make SMI and PPERF on by default
perf/x86/intel/p4: Fix unused variable warning in p4_pmu_init()
perf/x86/intel: Only check GP counters for PEBS constraints validation
perf/x86/amd/ibs: Fix comment typo in ibs_op_data
perf/amd/ibs: Advertise remote socket capability
perf/amd/ibs: Enable streaming store filter
perf/amd/ibs: Enable RIP bit63 hardware filtering
perf/amd/ibs: Enable fetch latency filtering
perf/amd/ibs: Support IBS_{FETCH|OP}_CTL2[Dis] to eliminate RMW race
perf/amd/ibs: Add new MSRs and CPUID bits definitions
perf/amd/ibs: Define macro for ldlat mask and shift
perf/amd/ibs: Avoid race between event add and NMI
perf/amd/ibs: Avoid calling perf_allow_kernel() from the IBS NMI handler
perf/amd/ibs: Preserve PhyAddrVal bit when clearing PhyAddr MSR
perf/amd/ibs: Limit ldlat->l3missonly dependency to Zen5
perf/amd/ibs: Account interrupt for discarded samples
perf/core: Simplify __detach_global_ctx_data()
perf/core: Try to allocate task_ctx_data quickly
perf/core: Pass GFP flags to attach_task_ctx_data()
|
|
KVM x86 VMXON and EFER.SVME extraction for 7.1
Move _only_ VMXON+VMXOFF and EFER.SVME toggling out of KVM (versus all of VMX
and SVM enabling) out of KVM and into the core kernel so that non-KVM TDX
enabling, e.g. for trusted I/O, can make SEAMCALLs without needing to ensure
KVM is fully loaded.
TIO isn't a hypervisor, and isn't trying to be a hypervisor. Specifically, TIO
should _never_ have it's own VMCSes (that are visible to the host; the
TDX-Module has it's own VMCSes to do SEAMCALL/SEAMRET), and so there is simply
no reason to move that functionality out of KVM.
With that out of the way, dealing with VMXON/VMXOFF and EFER.SVME is a fairly
simple refcounting game.
|
|
The third argument in INTEL_UNCORE_FR_EVENT_DESC() is subject to
__stringify(), and the extra double quote marks can result in the
expansion "3.814697266e-6" in the sysfs knobs, instead of
3.814697266e-6.
This is incorrect, though it may still work for perf, e.g.
perf stat -e uncore_iio_free_running_0/bw_in_port0/
Fixes: d8987048f665 ("perf/x86/intel/uncore: Support IIO free-running counters on DMR")
Closes: https://lore.kernel.org/all/20251231224233.113839-1-zide.chen@intel.com/
Reported-by: Chun-Tse Shao <ctshao@google.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chun-Tse Shao <ctshao@google.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://patch.msgid.link/20260313174050.171704-5-zide.chen@intel.com
|
|
In snbep_pci2phy_map_init(), in the nr_node_ids > 8 path,
uncore_device_to_die() may return -1 when all CPUs associated
with the UBOX device are offline.
Remove the WARN_ON_ONCE(die_id == -1) check for two reasons:
- The current code breaks out of the loop. This is incorrect because
pci_get_device() does not guarantee iteration in domain or bus order,
so additional UBOX devices may be skipped during the scan.
- Returning -EINVAL is incorrect, since marking offline buses with
die_id == -1 is expected and should not be treated as an error.
Separately, when NUMA is disabled on a NUMA-capable platform,
pcibus_to_node() returns NUMA_NO_NODE, causing uncore_device_to_die()
to return -1 for all PCI devices. As a result,
spr_update_device_location(), used on Intel SPR and EMR, ignores the
corresponding PMON units and does not add them to the RB tree.
Fix this by using uncore_pcibus_to_dieid(), which retrieves topology
from the UBOX GIDNIDMAP register and works regardless of whether NUMA
is enabled in Linux. This requires snbep_pci2phy_map_init() to be
added in spr_uncore_pci_init().
Keep uncore_device_to_die() only for the nr_node_ids > 8 case, where
NUMA is expected to be enabled.
Fixes: 9a7832ce3d92 ("perf/x86/intel/uncore: With > 8 nodes, get pci bus die id from NUMA info")
Fixes: 65248a9a9ee1 ("perf/x86/uncore: Add a quirk for UPI on SPR")
Signed-off-by: Zide Chen <zide.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://patch.msgid.link/20260313174050.171704-4-zide.chen@intel.com
|
|
This warning can be triggered if NUMA is disabled and the system
boots with fewer CPUs than the number of CPUs in die 0.
WARNING: CPU: 9 PID: 7257 at uncore.c:1157 uncore_pci_pmu_register+0x136/0x160 [intel_uncore]
Currently, the discovery table continues to be parsed even if all CPUs
in the associated die are offline. This can lead to an array overflow
at "pmu->boxes[die] = box" in uncore_pci_pmu_register(), which may
trigger the warning above or cause other issues.
Fixes: edae1f06c2cd ("perf/x86/intel/uncore: Parse uncore discovery tables")
Reported-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://patch.msgid.link/20260313174050.171704-3-zide.chen@intel.com
|
|
Kernel test robot reported:
Unverified Error/Warning (likely false positive, kindly check if
interested):
arch/x86/events/intel/uncore_discovery.c:293:2-8:
ERROR: missing iounmap; ioremap on line 288 and execution via
conditional on line 292
If domain->global_init() fails in __parse_discovery_table(), the
ioremap'ed MMIO region is not released before returning, resulting
in an MMIO mapping leak.
Fixes: b575fc0e3357 ("perf/x86/intel/uncore: Add domain global init callback")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://patch.msgid.link/20260313174050.171704-2-zide.chen@intel.com
|
|
AMD defines Extended Interrupt Local Vector Table (EILVT) registers to allow
for additional interrupt sources. While the APIC registers for those are
unique to AMD, the format of those registers follows the standard LVT
registers. Drop EILVT-specific macros in favor of the standard APIC
LVT macros.
Drop unused APIC_EILVT_NR_AMD_K8 and APIC_EILVT_LVTOFF while at it.
No functional change.
[ bp: Merge the two cleanup patches into one. ]
Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Manali Shukla <manali.shukla@amd.com>
Link: https://patch.msgid.link/b98d69037c0102d2ccd082a941888a689cd214c9.1775019269.git.naveen@kernel.org
|
|
The MSRs, SMI_COUNT and PPERF, are model-specific MSRs. A very long
CPU ID list is maintained to indicate the supported platforms. With more
and more platforms being introduced, new CPU IDs have to be kept adding.
Also, the old kernel has to be updated to apply the new CPU ID.
The MSRs have been introduced for a long time. There is no plan to
change them in the near future. Furthermore, the current code utilizes
rdmsr_safe() to check the availability of MSRs before using it.
Make them on by default. It should be good enough to only rely on the
rdmsr_safe() to check their availability for both existing and future
platforms.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260327052844.818218-1-dapeng1.mi@linux.intel.com
|
|
Auto counter reload may have a group of events with software events
present within it. The software event PMU isn't the x86_hybrid_pmu and
a container_of operation in intel_pmu_set_acr_caused_constr (via the
hybrid helper) could cause out of bound memory reads. Avoid this by
guarding the call to intel_pmu_set_acr_caused_constr with an
is_x86_event check.
Fixes: ec980e4facef ("perf/x86/intel: Support auto counter reload")
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Falcon <thomas.falcon@intel.com>
Link: https://patch.msgid.link/20260312194305.1834035-1-irogers@google.com
|
|
Build the kernel with make W=1 generates the following warning:
arch/x86/events/intel/p4.c: In function ‘p4_pmu_init’:
arch/x86/events/intel/p4.c:1370:27: error: variable ‘high’ set but not used [-Werror=unused-but-set-variable]
1370 | unsigned int low, high;
| ^~~~
This happens because, although both variables are declared and
initialized by rdmsr, only `low` is used in the subsequent if statement.
This patch uses the rdmsrq() macro instead of the rdmsr() macro.
The rdmsrq() macro avoids the use of high and low variables
because it reads the msr value in a single u64 variable.
Also, replace (1 << 7) with the proper macro.
Running `make W=1` again resolves the error.
I was unable to test the patch because
i do not have the hardware.
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Aldo Conte <aldocontelk@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260320112302.281549-1-aldocontelk@gmail.com
|
|
It's good enough to only check GP counters for PEBS constraints
validation since constraints overlap can only happen on GP counters.
Besides opportunistically refine the code style and use pr_warn() to
replace pr_info() as the message itself is a warning message.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260228053320.140406-1-dapeng1.mi@linux.intel.com
|
|
When omr_source is 0x2, the omr_snoop (bit[6]) and omr_promoted (bit[7])
fields are combined to represent the snoop information. However, the
omr_promoted field was not left-shifted by 1 bit, resulting in incorrect
snoop information.
Besides, the snoop information parsing is not accurate for some OMR
sources, like the snoop information should be SNOOP_NONE for these memory
access (omr_source >= 7) instead of SNOOP_HIT.
Fix these issues.
Closes: https://lore.kernel.org/all/CAP-5=fW4zLWFw1v38zCzB9-cseNSTTCtup=p2SDxZq7dPayVww@mail.gmail.com/
Fixes: d2bdcde9626c ("perf/x86/intel: Add support for PEBS memory auxiliary info field in DMR")
Reported-by: Ian Rogers <irogers@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://patch.msgid.link/20260311075201.2951073-1-dapeng1.mi@linux.intel.com
|
|
When running the command:
'perf record -e "{instructions,instructions:p}" -j any,counter sleep 1',
a "shift-out-of-bounds" warning is reported on CWF.
UBSAN: shift-out-of-bounds in /kbuild/src/consumer/arch/x86/events/intel/lbr.c:970:15
shift exponent 64 is too large for 64-bit type 'long long unsigned int'
......
intel_pmu_lbr_counters_reorder.isra.0.cold+0x2a/0xa7
intel_pmu_lbr_save_brstack+0xc0/0x4c0
setup_arch_pebs_sample_data+0x114b/0x2400
The warning occurs because the second "instructions:p" event, which
involves branch counters sampling, is incorrectly programmed to fixed
counter 0 instead of the general-purpose (GP) counters 0-3 that support
branch counters sampling. Currently only GP counters 0-3 support branch
counters sampling on CWF, any event involving branch counters sampling
should be programed on GP counters 0-3. Since the counter index of fixed
counter 0 is 32, it leads to the "src" value in below code is right
shifted 64 bits and trigger the "shift-out-of-bounds" warning.
cnt = (src >> (order[j] * LBR_INFO_BR_CNTR_BITS)) & LBR_INFO_BR_CNTR_MASK;
The root cause is the loss of the branch counters constraint for the
new event in the branch counters sampling event group. Since it isn't
yet part of the sibling list. This results in the second
"instructions:p" event being programmed on fixed counter 0 incorrectly
instead of the appropriate GP counters 0-3.
To address this, we apply the missing branch counters constraint for
the last event in the group. Additionally, we introduce a new function,
`intel_set_branch_counter_constr()`, to apply the branch counters
constraint and avoid code duplication.
Fixes: 33744916196b ("perf/x86/intel: Support branch counters logging")
Reported-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260228053320.140406-2-dapeng1.mi@linux.intel.com
Cc: stable@vger.kernel.org
|
|
Both Mi Dapeng and Ian Rogers noted that not everything that sets HES_STOPPED
is required to EF_UPDATE. Specifically the 'step 1' loop of rescheduling
explicitly does EF_UPDATE to ensure the counter value is read.
However, then 'step 2' simply leaves the new counter uninitialized when
HES_STOPPED, even though, as noted above, the thing that stopped them might not
be aware it needs to EF_RELOAD -- since it didn't EF_UPDATE on stop.
One such location that is affected is throttling, throttle does pmu->stop(, 0);
and unthrottle does pmu->start(, 0); possibly restarting an uninitialized counter.
Fixes: a4eaf7f14675 ("perf: Rework the PMU methods")
Reported-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reported-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://patch.msgid.link/20260311204035.GX606826@noisy.programming.kicks-ass.net
|
|
A production AMD EPYC system crashed with a NULL pointer dereference
in the PMU NMI handler:
BUG: kernel NULL pointer dereference, address: 0000000000000198
RIP: x86_perf_event_update+0xc/0xa0
Call Trace:
<NMI>
amd_pmu_v2_handle_irq+0x1a6/0x390
perf_event_nmi_handler+0x24/0x40
The faulting instruction is `cmpq $0x0, 0x198(%rdi)` with RDI=0,
corresponding to the `if (unlikely(!hwc->event_base))` check in
x86_perf_event_update() where hwc = &event->hw and event is NULL.
drgn inspection of the vmcore on CPU 106 showed a mismatch between
cpuc->active_mask and cpuc->events[]:
active_mask: 0x1e (bits 1, 2, 3, 4)
events[1]: 0xff1100136cbd4f38 (valid)
events[2]: 0x0 (NULL, but active_mask bit 2 set)
events[3]: 0xff1100076fd2cf38 (valid)
events[4]: 0xff1100079e990a90 (valid)
The event that should occupy events[2] was found in event_list[2]
with hw.idx=2 and hw.state=0x0, confirming x86_pmu_start() had run
(which clears hw.state and sets active_mask) but events[2] was
never populated.
Another event (event_list[0]) had hw.state=0x7 (STOPPED|UPTODATE|ARCH),
showing it was stopped when the PMU rescheduled events, confirming the
throttle-then-reschedule sequence occurred.
The root cause is commit 7e772a93eb61 ("perf/x86: Fix NULL event access
and potential PEBS record loss") which moved the cpuc->events[idx]
assignment out of x86_pmu_start() and into step 2 of x86_pmu_enable(),
after the PERF_HES_ARCH check. This broke any path that calls
pmu->start() without going through x86_pmu_enable() -- specifically
the unthrottle path:
perf_adjust_freq_unthr_events()
-> perf_event_unthrottle_group()
-> perf_event_unthrottle()
-> event->pmu->start(event, 0)
-> x86_pmu_start() // sets active_mask but not events[]
The race sequence is:
1. A group of perf events overflows, triggering group throttle via
perf_event_throttle_group(). All events are stopped: active_mask
bits cleared, events[] preserved (x86_pmu_stop no longer clears
events[] after commit 7e772a93eb61).
2. While still throttled (PERF_HES_STOPPED), x86_pmu_enable() runs
due to other scheduling activity. Stopped events that need to
move counters get PERF_HES_ARCH set and events[old_idx] cleared.
In step 2 of x86_pmu_enable(), PERF_HES_ARCH causes these events
to be skipped -- events[new_idx] is never set.
3. The timer tick unthrottles the group via pmu->start(). Since
commit 7e772a93eb61 removed the events[] assignment from
x86_pmu_start(), active_mask[new_idx] is set but events[new_idx]
remains NULL.
4. A PMC overflow NMI fires. The handler iterates active counters,
finds active_mask[2] set, reads events[2] which is NULL, and
crashes dereferencing it.
Move the cpuc->events[hwc->idx] assignment in x86_pmu_enable() to
before the PERF_HES_ARCH check, so that events[] is populated even
for events that are not immediately started. This ensures the
unthrottle path via pmu->start() always finds a valid event pointer.
Fixes: 7e772a93eb61 ("perf/x86: Fix NULL event access and potential PEBS record loss")
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260310-perf-v2-1-4a3156fce43c@debian.org
|
|
Move the innermost VMXON+VMXOFF logic out of KVM and into to core x86 so
that TDX can (eventually) force VMXON without having to rely on KVM being
loaded, e.g. to do SEAMCALLs during initialization.
Opportunistically update the comment regarding emergency disabling via NMI
to clarify that virt_rebooting will be set by _another_ emergency callback,
i.e. that virt_rebooting doesn't need to be set before VMCLEAR, only
before _this_ invocation does VMXOFF.
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Sagi Shahar <sagis@google.com>
Link: https://patch.msgid.link/20260214012702.2368778-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
IBS OP on future hardware can indicate data source from remote socket
as well. Advertise this capability to userspace so that userspace tools
can decode IBS data accordingly.
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260216042530.1546-8-ravi.bangoria@amd.com
|
|
IBS OP on future hardware supports recording samples only for instructions
that does streaming store. Like the existing IBS filters, samples pointing
to instruction which does not cause streaming store are discarded and IBS
restarts internally.
Example:
$ perf record -e ibs_op/strmst=1/ -- <workload>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260216042530.1546-7-ravi.bangoria@amd.com
|
|
IBS on future hardware adds the ability to filter IBS events by examining
RIP bit 63. Because Linux kernel addresses always have bit 63 set while
user-space addresses never do, this capability can be used as a privilege
filter.
So far, IBS supports privilege filtering in software (swfilt=1), where
samples are dropped in the NMI handler. The RIP bit63 hardware filter
enables IBS to be usable by unprivileged users without passing swfilt
flag. So, swfilt flag will silently be ignored when the hardware
filtering capability is present.
Example (non-root user):
$ perf record -e ibs_op//u -- <workload>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260216042530.1546-6-ravi.bangoria@amd.com
|
|
IBS Fetch on future hardware adds fetch latency filtering which
generates interrupt only when FetchLat value exceeds a programmable
threshold.
Hardware allows threshold in 128-cycle increment (i.e. 128, 256, 384
etc.) from 128 to 1920 cycles. Like the existing IBS filters, samples
that fail the latency test are dropped and IBS restarts internally.
Since hardware supports threshold in multiple of 128, add a software
filter on top to support latency threshold with the granularity of 1
cycle in between [128-1920].
Example:
# perf record -e ibs_fetch/fetchlat=128/ -c 10000 -a -- sleep 5
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260216042530.1546-5-ravi.bangoria@amd.com
|
|
The existing IBS_{FETCH|OP}_CTL MSRs combine control and status bits
which leads to RMW race between HW and SW:
HW SW
------------------------ ------------------------------
config = rdmsr(IBS_OP_CTL);
config &= ~EN;
Set IBS_OP_CTL[Val] to 1
trigger NMI
wrmsr(IBS_OP_CTL, config);
// Val is accidentally cleared
Future hardware adds a control-only MSR, IBS_{FETCH|OP}_CTL2, which
provides a second-level "disable" bit (Dis). IBS is now:
Enabled: IBS_{FETCH|OP}_CTL[En] = 1 && IBS_{FETCH|OP}_CTL2[Dis] = 0
Disabled: IBS_{FETCH|OP}_CTL[En] = 0 || IBS_{FETCH|OP}_CTL2[Dis] = 1
The separate "Dis" bit lets software disable IBS without touching any
status fields, eliminating the hardware/software race.
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260216042530.1546-4-ravi.bangoria@amd.com
|
|
Load latency filter threshold is encoded in config1[11:0]. Define a mask
for it instead of hardcoded 0xFFF. Unlike "config" fields whose layout
maps to PERF_{FETCH|OP}_CTL MSR, layout of "config1" is custom defined
so a new set of macros are needed for "config1" fields.
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://patch.msgid.link/20260216042530.1546-2-ravi.bangoria@amd.com
|
|
Consider the following race:
--------
o OP_CTL contains stale value: OP_CTL[Val]=1, OP_CTL[En]=0
o A new IBS OP event is being added
o [P]: Process context, [N]: NMI context
[P] perf_ibs_add(event) {
[P] if (test_and_set_bit(IBS_ENABLED, pcpu->state))
[P] return;
[P] /* pcpu->state = IBS_ENABLED */
[P]
[P] pcpu->event = event;
[P]
[P] perf_ibs_start(event) {
[P] set_bit(IBS_STARTED, pcpu->state);
[P] /* pcpu->state = IBS_ENABLED | IBS_STARTED */
[P] clear_bit(IBS_STOPPING, pcpu->state);
[P] /* pcpu->state = IBS_ENABLED | IBS_STARTED */
[N] --> NMI due to genuine FETCH event. perf_ibs_handle_irq()
[N] called for OP PMU as well.
[N]
[N] perf_ibs_handle_irq(perf_ibs) {
[N] event = pcpu->event; /* See line 6 */
[N]
[N] if (!test_bit(IBS_STARTED, pcpu->state)) /* false */
[N] return 0;
[N]
[N] if (WARN_ON_ONCE(!event)) /* false */
[N] goto fail;
[N]
[N] if (!(*buf++ & perf_ibs->valid_mask)) /* false due to stale
[N] * IBS_OP_CTL value */
[N] goto fail;
[N]
[N] ...
[N]
[N] perf_ibs_enable_event() // *Accidentally* enable the event.
[N] }
[N]
[N] /*
[N] * Repeated NMIs may follow due to accidentally enabled IBS OP
[N] * event if the sample period is very low. It could also lead
[N] * to pcpu->state corruption if the event gets throttled due
[N] * to too frequent NMIs.
[N] */
[P] perf_ibs_enable_event();
[P] }
[P] }
--------
We cannot safely clear IBS_{FETCH|OP}_CTL while disabling the event,
because the register might be read again later. So, clear the register
in the enable path - before we update pcpu->state and enable the event.
This guarantees that any NMI that lands in the gap finds Val=0 and
bails out cleanly.
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://patch.msgid.link/20260216042216.1440-6-ravi.bangoria@amd.com
|
|
Calling perf_allow_kernel() from the NMI context is unsafe and could be
fatal. Capture the permission at event-initialization time by storing it
in event->hw.flags, and have the NMI handler rely on that cached flag
instead of making the call directly.
Fixes: 50a53b60e141d ("perf/amd/ibs: Prevent leaking sensitive data to userspace")
Reported-by: Sadasivan Shaiju <sadasivan.shaiju2@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://patch.msgid.link/20260216042216.1440-5-ravi.bangoria@amd.com
|
|
Commit 50a53b60e141 ("perf/amd/ibs: Prevent leaking sensitive data to
userspace") zeroed the physical address and also cleared the PhyAddrVal
flag before copying the value into a perf sample to avoid exposing
physical addresses to unprivileged users.
Clearing PhyAddrVal, however, has an unintended side-effect: several
other IBS fields are considered valid only when this bit is set. As a
result, those otherwise correct fields are discarded, reducing IBS
functionality.
Continue to zero the physical address, but keep the PhyAddrVal bit
intact so the related fields remain usable while still preventing any
address leak.
Fixes: 50a53b60e141 ("perf/amd/ibs: Prevent leaking sensitive data to userspace")
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://patch.msgid.link/20260216042216.1440-4-ravi.bangoria@amd.com
|
|
The ldlat dependency on l3missonly is specific to Zen 5; newer generations
are not affected. This quirk is documented as an erratum in the following
Revision Guide.
Erratum: 1606 IBS (Instruction Based Sampling) OP Load Latency Filtering
May Capture Unwanted Samples When L3Miss Filtering is Disabled
Revision Guide for AMD Family 1Ah Models 00h-0Fh Processors,
Pub. 58251 Rev. 1.30 July 2025
https://bugzilla.kernel.org/attachment.cgi?id=309193
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://patch.msgid.link/20260216042216.1440-3-ravi.bangoria@amd.com
|
|
Add interrupt throttling accounting for below cases:
o IBS Op PMU: A software filter (in addition to the hardware filter)
drops samples whose load latency is below the user-specified
threshold.
o IBS Fetch PMU: Samples discarded due to the zero-RIP erratum (#1197).
Although these samples are discarded, the NMI cost is still incurred, so
they should be counted for interrupt throttling.
Fixes: 26db2e0c51fe83e1dd852c1321407835b481806e ("perf/x86/amd/ibs: Work around erratum #1197")
Fixes: d20610c19b4a22bc69085b7eb7a02741d51de30e ("perf/amd/ibs: Add support for OP Load Latency Filtering")
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://patch.msgid.link/20260216042216.1440-2-ravi.bangoria@amd.com
|
|
IMC on SPR and EMR does not support sub-channels. In contrast, CPUs
that use gnr_uncores[] (e.g. Granite Rapids and Sierra Forest)
implement two command schedulers (SCH0/SCH1) per memory channel,
providing logically independent command and data paths.
Do not reuse the spr_uncore_imc[] configuration for these CPUs.
Instead, introduce a dedicated gnr_uncore_imc[] with per-scheduler
events, so userspace can monitor SCH0 and SCH1 independently.
On these CPUs, replace cas_count_{read,write} with
cas_count_{read,write}_sch{0,1}. This may break existing userspace
that relies on cas_count_{read,write}, prompting it to switch to the
per-scheduler events, as the legacy event reports only partial
traffic (SCH0).
Fixes: 632c4bf6d007 ("perf/x86/intel/uncore: Support Granite Rapids")
Fixes: cb4a6ccf3583 ("perf/x86/intel/uncore: Support Sierra Forest and Grand Ridge")
Reported-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260210005225.20311-1-zide.chen@intel.com
|
|
Conversion performed via this Coccinelle script:
// SPDX-License-Identifier: GPL-2.0-only
// Options: --include-headers-for-types --all-includes --include-headers --keep-comments
virtual patch
@gfp depends on patch && !(file in "tools") && !(file in "samples")@
identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex,
kzalloc_obj,kzalloc_objs,kzalloc_flex,
kvmalloc_obj,kvmalloc_objs,kvmalloc_flex,
kvzalloc_obj,kvzalloc_objs,kvzalloc_flex};
@@
ALLOC(...
- , GFP_KERNEL
)
$ make coccicheck MODE=patch COCCI=gfp.cocci
Build and boot tested x86_64 with Fedora 42's GCC and Clang:
Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This converts some of the visually simpler cases that have been split
over multiple lines. I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.
Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script. I probably had made it a bit _too_ trivial.
So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.
The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.
As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance event updates from Ingo Molnar:
"x86 PMU driver updates:
- Add support for the core PMU for Intel Diamond Rapids (DMR) CPUs
(Dapeng Mi)
Compared to previous iterations of the Intel PMU code, there's been
a lot of changes, which center around three main areas:
- Introduce the OFF-MODULE RESPONSE (OMR) facility to replace the
Off-Core Response (OCR) facility
- New PEBS data source encoding layout
- Support the new "RDPMC user disable" feature
- Likewise, a large series adds uncore PMU support for Intel Diamond
Rapids (DMR) CPUs (Zide Chen)
This centers around these four main areas:
- DMR may have two Integrated I/O and Memory Hub (IMH) dies,
separate from the compute tile (CBB) dies. Each CBB and each IMH
die has its own discovery domain.
- Unlike prior CPUs that retrieve the global discovery table
portal exclusively via PCI or MSR, DMR uses PCI for IMH PMON
discovery and MSR for CBB PMON discovery.
- DMR introduces several new PMON types: SCA, HAMVF, D2D_ULA, UBR,
PCIE4, CRS, CPC, ITC, OTC, CMS, and PCIE6.
- IIO free-running counters in DMR are MMIO-based, unlike SPR.
- Also add support for Add missing PMON units for Intel Panther Lake,
and support Nova Lake (NVL), which largely maps to Panther Lake.
(Zide Chen)
- KVM integration: Add support for mediated vPMUs (by Kan Liang and
Sean Christopherson, with fixes and cleanups by Peter Zijlstra,
Sandipan Das and Mingwei Zhang)
- Add Intel cstate driver to support for Wildcat Lake (WCL) CPUs,
which are a low-power variant of Panther Lake (Zide Chen)
- Add core, cstate and MSR PMU support for the Airmont NP Intel CPU
(aka MaxLinear Lightning Mountain), which maps to the existing
Airmont code (Martin Schiller)
Performance enhancements:
- Speed up kexec shutdown by avoiding unnecessary cross CPU calls
(Jan H. Schönherr)
- Fix slow perf_event_task_exit() with LBR callstacks (Namhyung Kim)
User-space stack unwinding support:
- Various cleanups and refactorings in preparation to generalize the
unwinding code for other architectures (Jens Remus)
Uprobes updates:
- Transition from kmap_atomic to kmap_local_page (Keke Ming)
- Fix incorrect lockdep condition in filter_chain() (Breno Leitao)
- Fix XOL allocation failure for 32-bit tasks (Oleg Nesterov)
Misc fixes and cleanups:
- s390: Remove kvm_types.h from Kbuild (Randy Dunlap)
- x86/intel/uncore: Convert comma to semicolon (Chen Ni)
- x86/uncore: Clean up const mismatch (Greg Kroah-Hartman)
- x86/ibs: Fix typo in dc_l2tlb_miss comment (Xiang-Bin Shi)"
* tag 'perf-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
s390: remove kvm_types.h from Kbuild
uprobes: Fix incorrect lockdep condition in filter_chain()
x86/ibs: Fix typo in dc_l2tlb_miss comment
x86/uprobes: Fix XOL allocation failure for 32-bit tasks
perf/x86/intel/uncore: Convert comma to semicolon
perf/x86/intel: Add support for rdpmc user disable feature
perf/x86: Use macros to replace magic numbers in attr_rdpmc
perf/x86/intel: Add core PMU support for Novalake
perf/x86/intel: Add support for PEBS memory auxiliary info field in NVL
perf/x86/intel: Add core PMU support for DMR
perf/x86/intel: Add support for PEBS memory auxiliary info field in DMR
perf/x86/intel: Support the 4 new OMR MSRs introduced in DMR and NVL
perf/core: Fix slow perf_event_task_exit() with LBR callstacks
perf/core: Speed up kexec shutdown by avoiding unnecessary cross CPU calls
uprobes: use kmap_local_page() for temporary page mappings
arm/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
mips/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
arm64/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
riscv/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol()
perf/x86/intel/uncore: Add Nova Lake support
...
|
|
By default when users program perf to sample branch instructions
(PERF_COUNT_HW_BRANCH_INSTRUCTIONS) with a sample period of 1, perf
interprets this as a special case and enables BTS (Branch Trace Store)
as an optimization to avoid taking an interrupt on every branch.
Since BTS doesn't virtualize, this optimization doesn't make sense when
the request originates from a guest. Add an additional check that
prevents this optimization for virtualized events (exclude_host).
Reported-by: Jan H. Schönherr <jschoenh@amazon.de>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fernand Sieber <sieberf@amazon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Link: https://patch.msgid.link/20251211183604.868641-1-sieberf@amazon.com
|
|
Replace comma between expressions with semicolons.
Using a ',' in place of a ';' can have unintended side effects.
Although that is not the case here, it is seems best to use ';'
unless ',' is intended.
Found by inspection.
No functional change intended.
Compile tested only.
Fixes: e7d5f2ea0923 ("perf/x86/intel/uncore: Add Nova Lake support")
Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://patch.msgid.link/20260114023652.3926117-1-nichen@iscas.ac.cn
|
|
Starting with Panther Cove, the rdpmc user disable feature is supported.
This feature allows the perf system to disable user space rdpmc reads at
the counter level.
Currently, when a global counter is active, any user with rdpmc rights
can read it, even if perf access permissions forbid it (e.g., disallow
reading ring 0 counters). The rdpmc user disable feature mitigates this
security concern.
Details:
- A new RDPMC_USR_DISABLE bit (bit 37) in each EVNTSELx MSR indicates
that the GP counter cannot be read by RDPMC in ring 3.
- New RDPMC_USR_DISABLE bits in IA32_FIXED_CTR_CTRL MSR (bits 33, 37,
41, 45, etc.) for fixed counters 0, 1, 2, 3, etc.
- When calling rdpmc instruction for counter x, the following pseudo
code demonstrates how the counter value is obtained:
If (!CPL0 && RDPMC_USR_DISABLE[x] == 1) ? 0 : counter_value;
- RDPMC_USR_DISABLE is enumerated by CPUID.0x23.0.EBX[2].
This patch extends the current global user space rdpmc control logic via
the sysfs interface (/sys/devices/cpu/rdpmc) as follows:
- rdpmc = 0:
Global user space rdpmc and counter-level user space rdpmc for all
counters are both disabled.
- rdpmc = 1:
Global user space rdpmc is enabled during the mmap-enabled time window,
and counter-level user space rdpmc is enabled only for non-system-wide
events. This prevents counter data leaks as count data is cleared
during context switches.
- rdpmc = 2:
Global user space rdpmc and counter-level user space rdpmc for all
counters are enabled unconditionally.
The new rdpmc settings only affect newly activated perf events; currently
active perf events remain unaffected. This simplifies and cleans up the
code. The default value of rdpmc remains unchanged at 1.
For more details about rdpmc user disable, please refer to chapter 15
"RDPMC USER DISABLE" in ISE documentation.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260114011750.350569-8-dapeng1.mi@linux.intel.com
|
|
Replace magic numbers in attr_rdpmc with macros to improve readability
and make their meanings clearer for users.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260114011750.350569-7-dapeng1.mi@linux.intel.com
|
|
This patch enables core PMU support for Novalake, covering both P-core
and E-core. It includes Arctic Wolf-specific counters and PEBS
constraints, and the model-specific OMR extra registers table.
Since Coyote Cove shares the same PMU capabilities as Panther Cove, the
existing Panther Cove PMU enabling functions are reused for Coyote Cove.
For detailed information about counter constraints, please refer to
section 16.3 "COUNTER RESTRICTIONS" in the ISE documentation.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260114011750.350569-6-dapeng1.mi@linux.intel.com
|
|
Similar to DMR (Panther Cove uarch), both P-core (Coyote Cove uarch) and
E-core (Arctic Wolf uarch) of NVL adopt the new PEBS memory auxiliary
info layout.
Coyote Cove microarchitecture shares the same PMU capabilities, including
the memory auxiliary info layout, with Panther Cove. Arctic Wolf
microarchitecture has a similar layout to Panther Cove, with the only
difference being specific data source encoding for L2 hit cases (up to
the L2 cache level). The OMR encoding remains the same as in Panther Cove.
For detailed information on the memory auxiliary info encoding, please
refer to section 16.2 "PEBS LOAD LATENCY AND STORE LATENCY FACILITY" in
the latest ISE documentation.
This patch defines Arctic Wolf specific data source encoding and then
supports PEBS memory auxiliary info field for NVL.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260114011750.350569-5-dapeng1.mi@linux.intel.com
|
|
This patch enables core PMU features for Diamond Rapids (Panther Cove
microarchitecture), including Panther Cove specific counter and PEBS
constraints, a new cache events ID table, and the model-specific OMR
events extra registers table.
For detailed information about counter constraints, please refer to
section 16.3 "COUNTER RESTRICTIONS" in the ISE documentation.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260114011750.350569-4-dapeng1.mi@linux.intel.com
|
|
With the introduction of the OMR feature, the PEBS memory auxiliary info
field for load and store latency events has been restructured for DMR.
The memory auxiliary info field's bit[8] indicates whether a L2 cache
miss occurred for a memory load or store instruction. If bit[8] is 0,
it signifies no L2 cache miss, and bits[7:0] specify the exact cache data
source (up to the L2 cache level). If bit[8] is 1, bits[7:0] represent
the OMR encoding, indicating the specific L3 cache or memory region
involved in the memory access. A significant enhancement is OMR encoding
provides up to 8 fine-grained memory regions besides the cache region.
A significant enhancement for OMR encoding is the ability to provide
up to 8 fine-grained memory regions in addition to the cache region,
offering more detailed insights into memory access regions.
For detailed information on the memory auxiliary info encoding, please
refer to section 16.2 "PEBS LOAD LATENCY AND STORE LATENCY FACILITY" in
the ISE documentation.
This patch ensures that the PEBS memory auxiliary info field is correctly
interpreted and utilized in DMR.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260114011750.350569-3-dapeng1.mi@linux.intel.com
|
|
Diamond Rapids (DMR) and Nova Lake (NVL) introduce an enhanced
Off-Module Response (OMR) facility, replacing the Off-Core Response (OCR)
Performance Monitoring of previous processors.
Legacy microarchitectures used the OCR facility to evaluate off-core and
multi-core off-module transactions. The newly named OMR facility improves
OCR capabilities for scalable coverage of new memory systems in
multi-core module systems.
Similar to OCR, 4 additional off-module configuration MSRs
(OFFMODULE_RSP_0 to OFFMODULE_RSP_3) are introduced to specify attributes
of off-module transactions. When multiple identical OMR events are
created, they need to occupy the same OFFMODULE_RSP_x MSR. To ensure
these multiple identical OMR events can work simultaneously, the
intel_alt_er() and intel_fixup_er() helpers are enhanced to rotate these
OMR events across different OFFMODULE_RSP_* MSRs, similar to previous OCR
events.
For more details about OMR, please refer to section 16.1 "OFF-MODULE
RESPONSE (OMR) FACILITY" in ISE documentation.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260114011750.350569-2-dapeng1.mi@linux.intel.com
|
|
In a vain attempt to consolidate the email zoo switch everything to the
kernel.org account.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|