summaryrefslogtreecommitdiff
path: root/arch/x86/events
AgeCommit message (Collapse)Author
2026-05-05perf/x86/intel: Enable auto counter reload for DMRDapeng Mi
Panther cove µarch starts to support auto counter reload (ACR), but the static_call intel_pmu_enable_acr_event() is not updated for the Panther Cove µarch used by DMR. It leads to the auto counter reload is not really enabled on DMR. Update static_call intel_pmu_enable_acr_event() in intel_pmu_init_pnc(). Fixes: d345b6bb8860 ("perf/x86/intel: Add core PMU support for DMR") Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260430002558.712334-5-dapeng1.mi@linux.intel.com
2026-05-05perf/x86/intel: Disable PMI for self-reloaded ACR eventsDapeng Mi
On platforms with Auto Counter Reload (ACR) support, such as NVL, a "NMI received for unknown reason 30" warning is observed when running multiple events in a group with ACR enabled: $ perf record -e '{instructions/period=20000,acr_mask=0x2/u,\ cycles/period=40000,acr_mask=0x3/u}' ./test The warning occurs because the Performance Monitoring Interrupt (PMI) is enabled for the self-reloaded event (the cycles event in this case). According to the Intel SDM, the overflow bit (IA32_PERF_GLOBAL_STATUS.PMCn_OVF) is never set for self-reloaded events. Since the bit is not set, the perf NMI handler cannot identify the source of the interrupt, leading to the "unknown reason" message. Furthermore, enabling PMI for self-reloaded events is unnecessary and can lead to extraneous records that pollute the user's requested data. Disable the interrupt bit for all events configured with ACR self-reload. Fixes: ec980e4facef ("perf/x86/intel: Support auto counter reload") Reported-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260430002558.712334-4-dapeng1.mi@linux.intel.com
2026-05-05perf/x86/intel: Always reprogram ACR events to prevent stale masksDapeng Mi
Members of an ACR group are logically linked via a bitmask of their hardware counter indices. If some members of the group are assigned new hardware counters during rescheduling, even events that keep their original counter index must be updated with a new mask. Without this, an event will continue to use a stale acr_mask that references the old indices of its group peers. Ensure all ACR events are reprogrammed during the scheduling path to maintain consistency across the group. Fixes: ec980e4facef ("perf/x86/intel: Support auto counter reload") Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260430002558.712334-3-dapeng1.mi@linux.intel.com
2026-05-05perf/x86/intel: Improve validation and configuration of ACR masksDapeng Mi
Currently there are several issues on the user space ACR mask validation and configuration. - The validation for user space ACR mask (attr.config2) is incomplete, e.g., the ACR mask could include the index which belongs to another ACR events group, but it's not validated. - An early return on an invalid ACR mask caused all subsequent ACR groups to be skipped. - The stale hardware ACR mask (hw.config1) is not cleared before setting new hardware ACR mask. The following changes address all of the above issues. - Figure out the event index group of an ACR group. Any bits in the user-space mask not present in the index group are now dropped. - Instead of an early return on invalid bits, drop only the invalid portions and continue iterating through all ACR events to ensure full configuration. - Explicitly clear the stale hardware ACR mask for each event prior to writing the new configuration. Besides, a non-leader event member of ACR group could be disabled in theory. This could cause bit-shifting errors in the acr_mask of remaining group members. But since ACR sampling requires all events to be active, this should not be a big concern in real use case. Add a "FIXME" comment to notice this risk. Fixes: ec980e4facef ("perf/x86/intel: Support auto counter reload") Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260430002558.712334-2-dapeng1.mi@linux.intel.com
2026-04-17Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "Arm: - Add support for tracing in the standalone EL2 hypervisor code, which should help both debugging and performance analysis. This uses the new infrastructure for 'remote' trace buffers that can be exposed by non-kernel entities such as firmware, and which came through the tracing tree - Add support for GICv5 Per Processor Interrupts (PPIs), as the starting point for supporting the new GIC architecture in KVM - Finally add support for pKVM protected guests, where pages are unmapped from the host as they are faulted into the guest and can be shared back from the guest using pKVM hypercalls. Protected guests are created using a new machine type identifier. As the elusive guestmem has not yet delivered on its promises, anonymous memory is also supported This is only a first step towards full isolation from the host; for example, the CPU register state and DMA accesses are not yet isolated. Because this does not really yet bring fully what it promises, it is hidden behind CONFIG_ARM_PKVM_GUEST + 'kvm-arm.mode=protected', and also triggers TAINT_USER when a VM is created. Caveat emptor - Rework the dreaded user_mem_abort() function to make it more maintainable, reducing the amount of state being exposed to the various helpers and rendering a substantial amount of state immutable - Expand the Stage-2 page table dumper to support NV shadow page tables on a per-VM basis - Tidy up the pKVM PSCI proxy code to be slightly less hard to follow - Fix both SPE and TRBE in non-VHE configurations so that they do not generate spurious, out of context table walks that ultimately lead to very bad HW lockups - A small set of patches fixing the Stage-2 MMU freeing in error cases - Tighten-up accepted SMC immediate value to be only #0 for host SMCCC calls - The usual cleanups and other selftest churn LoongArch: - Use CSR_CRMD_PLV for kvm_arch_vcpu_in_kernel() - Add DMSINTC irqchip in kernel support RISC-V: - Fix steal time shared memory alignment checks - Fix vector context allocation leak - Fix array out-of-bounds in pmu_ctr_read() and pmu_fw_ctr_read_hi() - Fix double-free of sdata in kvm_pmu_clear_snapshot_area() - Fix integer overflow in kvm_pmu_validate_counter_mask() - Fix shift-out-of-bounds in make_xfence_request() - Fix lost write protection on huge pages during dirty logging - Split huge pages during fault handling for dirty logging - Skip CSR restore if VCPU is reloaded on the same core - Implement kvm_arch_has_default_irqchip() for KVM selftests - Factored-out ISA checks into separate sources - Added hideleg to struct kvm_vcpu_config - Factored-out VCPU config into separate sources - Support configuration of per-VM HGATP mode from KVM user space s390: - Support for ESA (31-bit) guests inside nested hypervisors - Remove restriction on memslot alignment, which is not needed anymore with the new gmap code - Fix LPSW/E to update the bear (which of course is the breaking event address register) x86: - Shut up various UBSAN warnings on reading module parameter before they were initialized - Don't zero-allocate page tables that are used for splitting hugepages in the TDP MMU, as KVM is guaranteed to set all SPTEs in the page table and thus write all bytes - As an optimization, bail early when trying to unsync 4KiB mappings if the target gfn can just be mapped with a 2MiB hugepage x86 generic: - Copy single-chunk MMIO write values into struct kvm_vcpu (more precisely struct kvm_mmio_fragment) to fix use-after-free stack bugs where KVM would dereference stack pointer after an exit to userspace - Clean up and comment the emulated MMIO code to try to make it easier to maintain (not necessarily "easy", but "easier") - Move VMXON+VMXOFF and EFER.SVME toggling out of KVM (not *all* of VMX and SVM enabling) as it is needed for trusted I/O - Advertise support for AVX512 Bit Matrix Multiply (BMM) instructions - Immediately fail the build if a required #define is missing in one of KVM's headers that is included multiple times - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected exception, mostly to prevent syzkaller from abusing the uAPI to trigger WARNs, but also because it can help prevent userspace from unintentionally crashing the VM - Exempt SMM from CPUID faulting on Intel, as per the spec - Misc hardening and cleanup changes x86 (AMD): - Fix and optimize IRQ window inhibit handling for AVIC; make it per-vCPU so that KVM doesn't prematurely re-enable AVIC if multiple vCPUs have to-be-injected IRQs - Clean up and optimize the OSVW handling, avoiding a bug in which KVM would overwrite state when enabling virtualization on multiple CPUs in parallel. This should not be a problem because OSVW should usually be the same for all CPUs - Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains about a "too large" size based purely on user input - Clean up and harden the pinning code for KVM_MEMORY_ENCRYPT_REG_REGION - Disallow synchronizing a VMSA of an already-launched/encrypted vCPU, as doing so for an SNP guest will crash the host due to an RMP violation page fault - Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped queries are required to hold kvm->lock, and enforce it by lockdep. Fix various bugs where sev_guest() was not ensured to be stable for the whole duration of a function or ioctl - Convert a pile of kvm->lock SEV code to guard() - Play nicer with userspace that does not enable KVM_CAP_EXCEPTION_PAYLOAD, for which KVM needs to set CR2 and DR6 as a response to ioctls such as KVM_GET_VCPU_EVENTS (even if the payload would end up in EXITINFO2 rather than CR2, for example). Only set CR2 and DR6 when consumption of the payload is imminent, but on the other hand force delivery of the payload in all paths where userspace retrieves CR2 or DR6 - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT instead of vmcb02->save.cr2. The value is out of sync after a save/restore or after a #PF is injected into L2 - Fix a class of nSVM bugs where some fields written by the CPU are not synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not up-to-date when saved by KVM_GET_NESTED_STATE - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after save+restore - Add a variety of missing nSVM consistency checks - Fix several bugs where KVM failed to correctly update VMCB fields on nested #VMEXIT - Fix several bugs where KVM failed to correctly synthesize #UD or #GP for SVM-related instructions - Add support for save+restore of virtualized LBRs (on SVM) - Refactor various helpers and macros to improve clarity and (hopefully) make the code easier to maintain - Aggressively sanitize fields when copying from vmcb12, to guard against unintentionally allowing L1 to utilize yet-to-be-defined features - Fix several bugs where KVM botched rAX legality checks when emulating SVM instructions. There are remaining issues in that KVM doesn't handle size prefix overrides for 64-bit guests - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of somewhat arbitrarily synthesizing #GP (i.e. don't double down on AMD's architectural but sketchy behavior of generating #GP for "unsupported" addresses) - Cache all used vmcb12 fields to further harden against TOCTOU bugs x86 (Intel): - Drop obsolete branch hint prefixes from the VMX instruction macros - Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a register input when appropriate - Code cleanups guest_memfd: - Don't mark guest_memfd folios as accessed, as guest_memfd doesn't support reclaim, the memory is unevictable, and there is no storage to write back to LoongArch selftests: - Add KVM PMU test cases s390 selftests: - Enable more memory selftests x86 selftests: - Add support for Hygon CPUs in KVM selftests - Fix a bug in the MSR test where it would get false failures on AMD/Hygon CPUs with exactly one of RDPID or RDTSCP - Add an MADV_COLLAPSE testcase for guest_memfd as a regression test for a bug where the kernel would attempt to collapse guest_memfd folios against KVM's will" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (373 commits) KVM: x86: use inlines instead of macros for is_sev_*guest x86/virt: Treat SVM as unsupported when running as an SEV+ guest KVM: SEV: Goto an existing error label if charging misc_cg for an ASID fails KVM: SVM: Move lock-protected allocation of SEV ASID into a separate helper KVM: SEV: use mutex guard in snp_handle_guest_req() KVM: SEV: use mutex guard in sev_mem_enc_unregister_region() KVM: SEV: use mutex guard in sev_mem_enc_ioctl() KVM: SEV: use mutex guard in snp_launch_update() KVM: SEV: Assert that kvm->lock is held when querying SEV+ support KVM: SEV: Document that checking for SEV+ guests when reclaiming memory is "safe" KVM: SEV: Hide "struct kvm_sev_info" behind CONFIG_KVM_AMD_SEV=y KVM: SEV: WARN on unhandled VM type when initializing VM KVM: LoongArch: selftests: Add PMU overflow interrupt test KVM: LoongArch: selftests: Add basic PMU event counting test KVM: LoongArch: selftests: Add cpucfg read/write helpers LoongArch: KVM: Add DMSINTC inject msi to vCPU LoongArch: KVM: Add DMSINTC device support LoongArch: KVM: Make vcpu_is_preempted() as a macro rather than function LoongArch: KVM: Move host CSR_GSTAT save and restore in context switch LoongArch: KVM: Move host CSR_EENTRY save and restore in context switch ...
2026-04-14Merge tag 'x86-cleanups-2026-04-13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: - Consolidate AMD and Hygon cases in parse_topology() (Wei Wang) - asm constraints cleanups in __iowrite32_copy() (Uros Bizjak) - Drop AMD Extended Interrupt LVT macros (Naveen N Rao) - Don't use REALLY_SLOW_IO for delays (Juergen Gross) - paravirt cleanups (Juergen Gross) - FPU code cleanups (Borislav Petkov) - split-lock handling code cleanups (Borislav Petkov, Ronan Pigott) * tag 'x86-cleanups-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fpu: Correct the comment explaining what xfeatures_in_use() does x86/split_lock: Don't warn about unknown split_lock_detect parameter x86/fpu: Correct misspelled xfeaures_to_write local var x86/apic: Drop AMD Extended Interrupt LVT macros x86/cpu/topology: Consolidate AMD and Hygon cases in parse_topology() block/floppy: Don't use REALLY_SLOW_IO for delays x86/paravirt: Replace io_delay() hook with a bool x86/irqflags: Preemptively move include paravirt.h directive where it belongs x86/split_lock: Restructure the unwieldy switch-case in sld_state_show() x86/local: Remove trailing semicolon from _ASM_XADD in local_add_return() x86/asm: Use inout "+" asm onstraint modifiers in __iowrite32_copy()
2026-04-14Merge tag 'perf-core-2026-04-13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance events updates from Ingo Molnar: "Core updates: - Try to allocate task_ctx_data quickly, to optimize O(N^2) algorithm on large systems with O(100k) threads (Namhyung Kim) AMD PMU driver IBS support updates and fixes, by Ravi Bangoria: - Fix interrupt accounting for discarded samples - Fix a Zen5-specific quirk - Fix PhyAddrVal handling - Fix NMI-safety with perf_allow_kernel() - Fix a race between event add and NMIs Intel PMU driver updates: - Only check GP counters for PEBS constraints validation (Dapeng Mi) MSR driver: - Turn SMI_COUNT and PPERF on by default, instead of a long list of CPU models to enable them on (Kan Liang) ... and misc cleanups and fixes by Aldf Conte, Anshuman Khandual, Namhyung Kim, Ravi Bangoria and Yen-Hsiang Hsu" * tag 'perf-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/events: Replace READ_ONCE() with standard pgtable accessors perf/x86/msr: Make SMI and PPERF on by default perf/x86/intel/p4: Fix unused variable warning in p4_pmu_init() perf/x86/intel: Only check GP counters for PEBS constraints validation perf/x86/amd/ibs: Fix comment typo in ibs_op_data perf/amd/ibs: Advertise remote socket capability perf/amd/ibs: Enable streaming store filter perf/amd/ibs: Enable RIP bit63 hardware filtering perf/amd/ibs: Enable fetch latency filtering perf/amd/ibs: Support IBS_{FETCH|OP}_CTL2[Dis] to eliminate RMW race perf/amd/ibs: Add new MSRs and CPUID bits definitions perf/amd/ibs: Define macro for ldlat mask and shift perf/amd/ibs: Avoid race between event add and NMI perf/amd/ibs: Avoid calling perf_allow_kernel() from the IBS NMI handler perf/amd/ibs: Preserve PhyAddrVal bit when clearing PhyAddr MSR perf/amd/ibs: Limit ldlat->l3missonly dependency to Zen5 perf/amd/ibs: Account interrupt for discarded samples perf/core: Simplify __detach_global_ctx_data() perf/core: Try to allocate task_ctx_data quickly perf/core: Pass GFP flags to attach_task_ctx_data()
2026-04-13Merge tag 'kvm-x86-vmxon-7.1' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 VMXON and EFER.SVME extraction for 7.1 Move _only_ VMXON+VMXOFF and EFER.SVME toggling out of KVM (versus all of VMX and SVM enabling) out of KVM and into the core kernel so that non-KVM TDX enabling, e.g. for trusted I/O, can make SEAMCALLs without needing to ensure KVM is fully loaded. TIO isn't a hypervisor, and isn't trying to be a hypervisor. Specifically, TIO should _never_ have it's own VMCSes (that are visible to the host; the TDX-Module has it's own VMCSes to do SEAMCALL/SEAMRET), and so there is simply no reason to move that functionality out of KVM. With that out of the way, dealing with VMXON/VMXOFF and EFER.SVME is a fairly simple refcounting game.
2026-04-07perf/x86/intel/uncore: Remove extra double quote markZide Chen
The third argument in INTEL_UNCORE_FR_EVENT_DESC() is subject to __stringify(), and the extra double quote marks can result in the expansion "3.814697266e-6" in the sysfs knobs, instead of 3.814697266e-6. This is incorrect, though it may still work for perf, e.g. perf stat -e uncore_iio_free_running_0/bw_in_port0/ Fixes: d8987048f665 ("perf/x86/intel/uncore: Support IIO free-running counters on DMR") Closes: https://lore.kernel.org/all/20251231224233.113839-1-zide.chen@intel.com/ Reported-by: Chun-Tse Shao <ctshao@google.com> Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chun-Tse Shao <ctshao@google.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://patch.msgid.link/20260313174050.171704-5-zide.chen@intel.com
2026-04-07perf/x86/intel/uncore: Fix die ID init and look up bugsZide Chen
In snbep_pci2phy_map_init(), in the nr_node_ids > 8 path, uncore_device_to_die() may return -1 when all CPUs associated with the UBOX device are offline. Remove the WARN_ON_ONCE(die_id == -1) check for two reasons: - The current code breaks out of the loop. This is incorrect because pci_get_device() does not guarantee iteration in domain or bus order, so additional UBOX devices may be skipped during the scan. - Returning -EINVAL is incorrect, since marking offline buses with die_id == -1 is expected and should not be treated as an error. Separately, when NUMA is disabled on a NUMA-capable platform, pcibus_to_node() returns NUMA_NO_NODE, causing uncore_device_to_die() to return -1 for all PCI devices. As a result, spr_update_device_location(), used on Intel SPR and EMR, ignores the corresponding PMON units and does not add them to the RB tree. Fix this by using uncore_pcibus_to_dieid(), which retrieves topology from the UBOX GIDNIDMAP register and works regardless of whether NUMA is enabled in Linux. This requires snbep_pci2phy_map_init() to be added in spr_uncore_pci_init(). Keep uncore_device_to_die() only for the nr_node_ids > 8 case, where NUMA is expected to be enabled. Fixes: 9a7832ce3d92 ("perf/x86/intel/uncore: With > 8 nodes, get pci bus die id from NUMA info") Fixes: 65248a9a9ee1 ("perf/x86/uncore: Add a quirk for UPI on SPR") Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Steve Wahl <steve.wahl@hpe.com> Link: https://patch.msgid.link/20260313174050.171704-4-zide.chen@intel.com
2026-04-07perf/x86/intel/uncore: Skip discovery table for offline diesZide Chen
This warning can be triggered if NUMA is disabled and the system boots with fewer CPUs than the number of CPUs in die 0. WARNING: CPU: 9 PID: 7257 at uncore.c:1157 uncore_pci_pmu_register+0x136/0x160 [intel_uncore] Currently, the discovery table continues to be parsed even if all CPUs in the associated die are offline. This can lead to an array overflow at "pmu->boxes[die] = box" in uncore_pci_pmu_register(), which may trigger the warning above or cause other issues. Fixes: edae1f06c2cd ("perf/x86/intel/uncore: Parse uncore discovery tables") Reported-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Steve Wahl <steve.wahl@hpe.com> Link: https://patch.msgid.link/20260313174050.171704-3-zide.chen@intel.com
2026-04-07perf/x86/intel/uncore: Fix iounmap() leak on global_init failureZide Chen
Kernel test robot reported: Unverified Error/Warning (likely false positive, kindly check if interested): arch/x86/events/intel/uncore_discovery.c:293:2-8: ERROR: missing iounmap; ioremap on line 288 and execution via conditional on line 292 If domain->global_init() fails in __parse_discovery_table(), the ioremap'ed MMIO region is not released before returning, resulting in an MMIO mapping leak. Fixes: b575fc0e3357 ("perf/x86/intel/uncore: Add domain global init callback") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://patch.msgid.link/20260313174050.171704-2-zide.chen@intel.com
2026-04-04x86/apic: Drop AMD Extended Interrupt LVT macrosNaveen N Rao (AMD)
AMD defines Extended Interrupt Local Vector Table (EILVT) registers to allow for additional interrupt sources. While the APIC registers for those are unique to AMD, the format of those registers follows the standard LVT registers. Drop EILVT-specific macros in favor of the standard APIC LVT macros. Drop unused APIC_EILVT_NR_AMD_K8 and APIC_EILVT_LVTOFF while at it. No functional change. [ bp: Merge the two cleanup patches into one. ] Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Manali Shukla <manali.shukla@amd.com> Link: https://patch.msgid.link/b98d69037c0102d2ccd082a941888a689cd214c9.1775019269.git.naveen@kernel.org
2026-04-03perf/x86/msr: Make SMI and PPERF on by defaultKan Liang
The MSRs, SMI_COUNT and PPERF, are model-specific MSRs. A very long CPU ID list is maintained to indicate the supported platforms. With more and more platforms being introduced, new CPU IDs have to be kept adding. Also, the old kernel has to be updated to apply the new CPU ID. The MSRs have been introduced for a long time. There is no plan to change them in the near future. Furthermore, the current code utilizes rdmsr_safe() to check the availability of MSRs before using it. Make them on by default. It should be good enough to only rely on the rdmsr_safe() to check their availability for both existing and future platforms. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260327052844.818218-1-dapeng1.mi@linux.intel.com
2026-04-02perf/x86: Fix potential bad container_of in intel_pmu_hw_configIan Rogers
Auto counter reload may have a group of events with software events present within it. The software event PMU isn't the x86_hybrid_pmu and a container_of operation in intel_pmu_set_acr_caused_constr (via the hybrid helper) could cause out of bound memory reads. Avoid this by guarding the call to intel_pmu_set_acr_caused_constr with an is_x86_event check. Fixes: ec980e4facef ("perf/x86/intel: Support auto counter reload") Signed-off-by: Ian Rogers <irogers@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Falcon <thomas.falcon@intel.com> Link: https://patch.msgid.link/20260312194305.1834035-1-irogers@google.com
2026-03-24perf/x86/intel/p4: Fix unused variable warning in p4_pmu_init()Aldo Conte
Build the kernel with make W=1 generates the following warning: arch/x86/events/intel/p4.c: In function ‘p4_pmu_init’: arch/x86/events/intel/p4.c:1370:27: error: variable ‘high’ set but not used [-Werror=unused-but-set-variable] 1370 | unsigned int low, high; | ^~~~ This happens because, although both variables are declared and initialized by rdmsr, only `low` is used in the subsequent if statement. This patch uses the rdmsrq() macro instead of the rdmsr() macro. The rdmsrq() macro avoids the use of high and low variables because it reads the msr value in a single u64 variable. Also, replace (1 << 7) with the proper macro. Running `make W=1` again resolves the error. I was unable to test the patch because i do not have the hardware. Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Aldo Conte <aldocontelk@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260320112302.281549-1-aldocontelk@gmail.com
2026-03-12perf/x86/intel: Only check GP counters for PEBS constraints validationDapeng Mi
It's good enough to only check GP counters for PEBS constraints validation since constraints overlap can only happen on GP counters. Besides opportunistically refine the code style and use pr_warn() to replace pr_info() as the message itself is a warning message. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260228053320.140406-1-dapeng1.mi@linux.intel.com
2026-03-12perf/x86/intel: Fix OMR snoop information parsing issuesDapeng Mi
When omr_source is 0x2, the omr_snoop (bit[6]) and omr_promoted (bit[7]) fields are combined to represent the snoop information. However, the omr_promoted field was not left-shifted by 1 bit, resulting in incorrect snoop information. Besides, the snoop information parsing is not accurate for some OMR sources, like the snoop information should be SNOOP_NONE for these memory access (omr_source >= 7) instead of SNOOP_HIT. Fix these issues. Closes: https://lore.kernel.org/all/CAP-5=fW4zLWFw1v38zCzB9-cseNSTTCtup=p2SDxZq7dPayVww@mail.gmail.com/ Fixes: d2bdcde9626c ("perf/x86/intel: Add support for PEBS memory auxiliary info field in DMR") Reported-by: Ian Rogers <irogers@google.com> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://patch.msgid.link/20260311075201.2951073-1-dapeng1.mi@linux.intel.com
2026-03-12perf/x86/intel: Add missing branch counters constraint applyDapeng Mi
When running the command: 'perf record -e "{instructions,instructions:p}" -j any,counter sleep 1', a "shift-out-of-bounds" warning is reported on CWF. UBSAN: shift-out-of-bounds in /kbuild/src/consumer/arch/x86/events/intel/lbr.c:970:15 shift exponent 64 is too large for 64-bit type 'long long unsigned int' ...... intel_pmu_lbr_counters_reorder.isra.0.cold+0x2a/0xa7 intel_pmu_lbr_save_brstack+0xc0/0x4c0 setup_arch_pebs_sample_data+0x114b/0x2400 The warning occurs because the second "instructions:p" event, which involves branch counters sampling, is incorrectly programmed to fixed counter 0 instead of the general-purpose (GP) counters 0-3 that support branch counters sampling. Currently only GP counters 0-3 support branch counters sampling on CWF, any event involving branch counters sampling should be programed on GP counters 0-3. Since the counter index of fixed counter 0 is 32, it leads to the "src" value in below code is right shifted 64 bits and trigger the "shift-out-of-bounds" warning. cnt = (src >> (order[j] * LBR_INFO_BR_CNTR_BITS)) & LBR_INFO_BR_CNTR_MASK; The root cause is the loss of the branch counters constraint for the new event in the branch counters sampling event group. Since it isn't yet part of the sibling list. This results in the second "instructions:p" event being programmed on fixed counter 0 incorrectly instead of the appropriate GP counters 0-3. To address this, we apply the missing branch counters constraint for the last event in the group. Additionally, we introduce a new function, `intel_set_branch_counter_constr()`, to apply the branch counters constraint and avoid code duplication. Fixes: 33744916196b ("perf/x86/intel: Support branch counters logging") Reported-by: Xudong Hao <xudong.hao@intel.com> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260228053320.140406-2-dapeng1.mi@linux.intel.com Cc: stable@vger.kernel.org
2026-03-12x86/perf: Make sure to program the counter value for stopped events on migrationPeter Zijlstra
Both Mi Dapeng and Ian Rogers noted that not everything that sets HES_STOPPED is required to EF_UPDATE. Specifically the 'step 1' loop of rescheduling explicitly does EF_UPDATE to ensure the counter value is read. However, then 'step 2' simply leaves the new counter uninitialized when HES_STOPPED, even though, as noted above, the thing that stopped them might not be aware it needs to EF_RELOAD -- since it didn't EF_UPDATE on stop. One such location that is affected is throttling, throttle does pmu->stop(, 0); and unthrottle does pmu->start(, 0); possibly restarting an uninitialized counter. Fixes: a4eaf7f14675 ("perf: Rework the PMU methods") Reported-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Reported-by: Ian Rogers <irogers@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://patch.msgid.link/20260311204035.GX606826@noisy.programming.kicks-ass.net
2026-03-12perf/x86: Move event pointer setup earlier in x86_pmu_enable()Breno Leitao
A production AMD EPYC system crashed with a NULL pointer dereference in the PMU NMI handler: BUG: kernel NULL pointer dereference, address: 0000000000000198 RIP: x86_perf_event_update+0xc/0xa0 Call Trace: <NMI> amd_pmu_v2_handle_irq+0x1a6/0x390 perf_event_nmi_handler+0x24/0x40 The faulting instruction is `cmpq $0x0, 0x198(%rdi)` with RDI=0, corresponding to the `if (unlikely(!hwc->event_base))` check in x86_perf_event_update() where hwc = &event->hw and event is NULL. drgn inspection of the vmcore on CPU 106 showed a mismatch between cpuc->active_mask and cpuc->events[]: active_mask: 0x1e (bits 1, 2, 3, 4) events[1]: 0xff1100136cbd4f38 (valid) events[2]: 0x0 (NULL, but active_mask bit 2 set) events[3]: 0xff1100076fd2cf38 (valid) events[4]: 0xff1100079e990a90 (valid) The event that should occupy events[2] was found in event_list[2] with hw.idx=2 and hw.state=0x0, confirming x86_pmu_start() had run (which clears hw.state and sets active_mask) but events[2] was never populated. Another event (event_list[0]) had hw.state=0x7 (STOPPED|UPTODATE|ARCH), showing it was stopped when the PMU rescheduled events, confirming the throttle-then-reschedule sequence occurred. The root cause is commit 7e772a93eb61 ("perf/x86: Fix NULL event access and potential PEBS record loss") which moved the cpuc->events[idx] assignment out of x86_pmu_start() and into step 2 of x86_pmu_enable(), after the PERF_HES_ARCH check. This broke any path that calls pmu->start() without going through x86_pmu_enable() -- specifically the unthrottle path: perf_adjust_freq_unthr_events() -> perf_event_unthrottle_group() -> perf_event_unthrottle() -> event->pmu->start(event, 0) -> x86_pmu_start() // sets active_mask but not events[] The race sequence is: 1. A group of perf events overflows, triggering group throttle via perf_event_throttle_group(). All events are stopped: active_mask bits cleared, events[] preserved (x86_pmu_stop no longer clears events[] after commit 7e772a93eb61). 2. While still throttled (PERF_HES_STOPPED), x86_pmu_enable() runs due to other scheduling activity. Stopped events that need to move counters get PERF_HES_ARCH set and events[old_idx] cleared. In step 2 of x86_pmu_enable(), PERF_HES_ARCH causes these events to be skipped -- events[new_idx] is never set. 3. The timer tick unthrottles the group via pmu->start(). Since commit 7e772a93eb61 removed the events[] assignment from x86_pmu_start(), active_mask[new_idx] is set but events[new_idx] remains NULL. 4. A PMC overflow NMI fires. The handler iterates active counters, finds active_mask[2] set, reads events[2] which is NULL, and crashes dereferencing it. Move the cpuc->events[hwc->idx] assignment in x86_pmu_enable() to before the PERF_HES_ARCH check, so that events[] is populated even for events that are not immediately started. This ensures the unthrottle path via pmu->start() always finds a valid event pointer. Fixes: 7e772a93eb61 ("perf/x86: Fix NULL event access and potential PEBS record loss") Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260310-perf-v2-1-4a3156fce43c@debian.org
2026-03-04KVM: VMX: Move core VMXON enablement to kernelSean Christopherson
Move the innermost VMXON+VMXOFF logic out of KVM and into to core x86 so that TDX can (eventually) force VMXON without having to rely on KVM being loaded, e.g. to do SEAMCALLs during initialization. Opportunistically update the comment regarding emergency disabling via NMI to clarify that virt_rebooting will be set by _another_ emergency callback, i.e. that virt_rebooting doesn't need to be set before VMCLEAR, only before _this_ invocation does VMXOFF. Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-02-28perf/amd/ibs: Advertise remote socket capabilityRavi Bangoria
IBS OP on future hardware can indicate data source from remote socket as well. Advertise this capability to userspace so that userspace tools can decode IBS data accordingly. Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260216042530.1546-8-ravi.bangoria@amd.com
2026-02-28perf/amd/ibs: Enable streaming store filterRavi Bangoria
IBS OP on future hardware supports recording samples only for instructions that does streaming store. Like the existing IBS filters, samples pointing to instruction which does not cause streaming store are discarded and IBS restarts internally. Example: $ perf record -e ibs_op/strmst=1/ -- <workload> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260216042530.1546-7-ravi.bangoria@amd.com
2026-02-28perf/amd/ibs: Enable RIP bit63 hardware filteringRavi Bangoria
IBS on future hardware adds the ability to filter IBS events by examining RIP bit 63. Because Linux kernel addresses always have bit 63 set while user-space addresses never do, this capability can be used as a privilege filter. So far, IBS supports privilege filtering in software (swfilt=1), where samples are dropped in the NMI handler. The RIP bit63 hardware filter enables IBS to be usable by unprivileged users without passing swfilt flag. So, swfilt flag will silently be ignored when the hardware filtering capability is present. Example (non-root user): $ perf record -e ibs_op//u -- <workload> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260216042530.1546-6-ravi.bangoria@amd.com
2026-02-28perf/amd/ibs: Enable fetch latency filteringRavi Bangoria
IBS Fetch on future hardware adds fetch latency filtering which generates interrupt only when FetchLat value exceeds a programmable threshold. Hardware allows threshold in 128-cycle increment (i.e. 128, 256, 384 etc.) from 128 to 1920 cycles. Like the existing IBS filters, samples that fail the latency test are dropped and IBS restarts internally. Since hardware supports threshold in multiple of 128, add a software filter on top to support latency threshold with the granularity of 1 cycle in between [128-1920]. Example: # perf record -e ibs_fetch/fetchlat=128/ -c 10000 -a -- sleep 5 Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260216042530.1546-5-ravi.bangoria@amd.com
2026-02-28perf/amd/ibs: Support IBS_{FETCH|OP}_CTL2[Dis] to eliminate RMW raceRavi Bangoria
The existing IBS_{FETCH|OP}_CTL MSRs combine control and status bits which leads to RMW race between HW and SW: HW SW ------------------------ ------------------------------ config = rdmsr(IBS_OP_CTL); config &= ~EN; Set IBS_OP_CTL[Val] to 1 trigger NMI wrmsr(IBS_OP_CTL, config); // Val is accidentally cleared Future hardware adds a control-only MSR, IBS_{FETCH|OP}_CTL2, which provides a second-level "disable" bit (Dis). IBS is now: Enabled: IBS_{FETCH|OP}_CTL[En] = 1 && IBS_{FETCH|OP}_CTL2[Dis] = 0 Disabled: IBS_{FETCH|OP}_CTL[En] = 0 || IBS_{FETCH|OP}_CTL2[Dis] = 1 The separate "Dis" bit lets software disable IBS without touching any status fields, eliminating the hardware/software race. Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260216042530.1546-4-ravi.bangoria@amd.com
2026-02-27perf/amd/ibs: Define macro for ldlat mask and shiftRavi Bangoria
Load latency filter threshold is encoded in config1[11:0]. Define a mask for it instead of hardcoded 0xFFF. Unlike "config" fields whose layout maps to PERF_{FETCH|OP}_CTL MSR, layout of "config1" is custom defined so a new set of macros are needed for "config1" fields. Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://patch.msgid.link/20260216042530.1546-2-ravi.bangoria@amd.com
2026-02-27perf/amd/ibs: Avoid race between event add and NMIRavi Bangoria
Consider the following race: -------- o OP_CTL contains stale value: OP_CTL[Val]=1, OP_CTL[En]=0 o A new IBS OP event is being added o [P]: Process context, [N]: NMI context [P] perf_ibs_add(event) { [P] if (test_and_set_bit(IBS_ENABLED, pcpu->state)) [P] return; [P] /* pcpu->state = IBS_ENABLED */ [P] [P] pcpu->event = event; [P] [P] perf_ibs_start(event) { [P] set_bit(IBS_STARTED, pcpu->state); [P] /* pcpu->state = IBS_ENABLED | IBS_STARTED */ [P] clear_bit(IBS_STOPPING, pcpu->state); [P] /* pcpu->state = IBS_ENABLED | IBS_STARTED */ [N] --> NMI due to genuine FETCH event. perf_ibs_handle_irq() [N] called for OP PMU as well. [N] [N] perf_ibs_handle_irq(perf_ibs) { [N] event = pcpu->event; /* See line 6 */ [N] [N] if (!test_bit(IBS_STARTED, pcpu->state)) /* false */ [N] return 0; [N] [N] if (WARN_ON_ONCE(!event)) /* false */ [N] goto fail; [N] [N] if (!(*buf++ & perf_ibs->valid_mask)) /* false due to stale [N] * IBS_OP_CTL value */ [N] goto fail; [N] [N] ... [N] [N] perf_ibs_enable_event() // *Accidentally* enable the event. [N] } [N] [N] /* [N] * Repeated NMIs may follow due to accidentally enabled IBS OP [N] * event if the sample period is very low. It could also lead [N] * to pcpu->state corruption if the event gets throttled due [N] * to too frequent NMIs. [N] */ [P] perf_ibs_enable_event(); [P] } [P] } -------- We cannot safely clear IBS_{FETCH|OP}_CTL while disabling the event, because the register might be read again later. So, clear the register in the enable path - before we update pcpu->state and enable the event. This guarantees that any NMI that lands in the gap finds Val=0 and bails out cleanly. Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://patch.msgid.link/20260216042216.1440-6-ravi.bangoria@amd.com
2026-02-27perf/amd/ibs: Avoid calling perf_allow_kernel() from the IBS NMI handlerRavi Bangoria
Calling perf_allow_kernel() from the NMI context is unsafe and could be fatal. Capture the permission at event-initialization time by storing it in event->hw.flags, and have the NMI handler rely on that cached flag instead of making the call directly. Fixes: 50a53b60e141d ("perf/amd/ibs: Prevent leaking sensitive data to userspace") Reported-by: Sadasivan Shaiju <sadasivan.shaiju2@amd.com> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://patch.msgid.link/20260216042216.1440-5-ravi.bangoria@amd.com
2026-02-27perf/amd/ibs: Preserve PhyAddrVal bit when clearing PhyAddr MSRRavi Bangoria
Commit 50a53b60e141 ("perf/amd/ibs: Prevent leaking sensitive data to userspace") zeroed the physical address and also cleared the PhyAddrVal flag before copying the value into a perf sample to avoid exposing physical addresses to unprivileged users. Clearing PhyAddrVal, however, has an unintended side-effect: several other IBS fields are considered valid only when this bit is set. As a result, those otherwise correct fields are discarded, reducing IBS functionality. Continue to zero the physical address, but keep the PhyAddrVal bit intact so the related fields remain usable while still preventing any address leak. Fixes: 50a53b60e141 ("perf/amd/ibs: Prevent leaking sensitive data to userspace") Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://patch.msgid.link/20260216042216.1440-4-ravi.bangoria@amd.com
2026-02-27perf/amd/ibs: Limit ldlat->l3missonly dependency to Zen5Ravi Bangoria
The ldlat dependency on l3missonly is specific to Zen 5; newer generations are not affected. This quirk is documented as an erratum in the following Revision Guide. Erratum: 1606 IBS (Instruction Based Sampling) OP Load Latency Filtering May Capture Unwanted Samples When L3Miss Filtering is Disabled Revision Guide for AMD Family 1Ah Models 00h-0Fh Processors, Pub. 58251 Rev. 1.30 July 2025 https://bugzilla.kernel.org/attachment.cgi?id=309193 Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://patch.msgid.link/20260216042216.1440-3-ravi.bangoria@amd.com
2026-02-27perf/amd/ibs: Account interrupt for discarded samplesRavi Bangoria
Add interrupt throttling accounting for below cases: o IBS Op PMU: A software filter (in addition to the hardware filter) drops samples whose load latency is below the user-specified threshold. o IBS Fetch PMU: Samples discarded due to the zero-RIP erratum (#1197). Although these samples are discarded, the NMI cost is still incurred, so they should be counted for interrupt throttling. Fixes: 26db2e0c51fe83e1dd852c1321407835b481806e ("perf/x86/amd/ibs: Work around erratum #1197") Fixes: d20610c19b4a22bc69085b7eb7a02741d51de30e ("perf/amd/ibs: Add support for OP Load Latency Filtering") Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://patch.msgid.link/20260216042216.1440-2-ravi.bangoria@amd.com
2026-02-23perf/x86/intel/uncore: Add per-scheduler IMC CAS count eventsZide Chen
IMC on SPR and EMR does not support sub-channels. In contrast, CPUs that use gnr_uncores[] (e.g. Granite Rapids and Sierra Forest) implement two command schedulers (SCH0/SCH1) per memory channel, providing logically independent command and data paths. Do not reuse the spr_uncore_imc[] configuration for these CPUs. Instead, introduce a dedicated gnr_uncore_imc[] with per-scheduler events, so userspace can monitor SCH0 and SCH1 independently. On these CPUs, replace cas_count_{read,write} with cas_count_{read,write}_sch{0,1}. This may break existing userspace that relies on cas_count_{read,write}, prompting it to switch to the per-scheduler events, as the legacy event reports only partial traffic (SCH0). Fixes: 632c4bf6d007 ("perf/x86/intel/uncore: Support Granite Rapids") Fixes: cb4a6ccf3583 ("perf/x86/intel/uncore: Support Sierra Forest and Grand Ridge") Reported-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260210005225.20311-1-zide.chen@intel.com
2026-02-22Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL usesKees Cook
Conversion performed via this Coccinelle script: // SPDX-License-Identifier: GPL-2.0-only // Options: --include-headers-for-types --all-includes --include-headers --keep-comments virtual patch @gfp depends on patch && !(file in "tools") && !(file in "samples")@ identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex, kzalloc_obj,kzalloc_objs,kzalloc_flex, kvmalloc_obj,kvmalloc_objs,kvmalloc_flex, kvzalloc_obj,kvzalloc_objs,kvzalloc_flex}; @@ ALLOC(... - , GFP_KERNEL ) $ make coccicheck MODE=patch COCCI=gfp.cocci Build and boot tested x86_64 with Fedora 42's GCC and Clang: Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert more 'alloc_obj' cases to default GFP_KERNEL argumentsLinus Torvalds
This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert 'alloc_flex' family to use the new default GFP_KERNEL argumentLinus Torvalds
This is the exact same thing as the 'alloc_obj()' version, only much smaller because there are a lot fewer users of the *alloc_flex() interface. As with alloc_obj() version, this was done entirely with mindless brute force, using the same script, except using 'flex' in the pattern rather than 'objs*'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-10Merge tag 'perf-core-2026-02-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance event updates from Ingo Molnar: "x86 PMU driver updates: - Add support for the core PMU for Intel Diamond Rapids (DMR) CPUs (Dapeng Mi) Compared to previous iterations of the Intel PMU code, there's been a lot of changes, which center around three main areas: - Introduce the OFF-MODULE RESPONSE (OMR) facility to replace the Off-Core Response (OCR) facility - New PEBS data source encoding layout - Support the new "RDPMC user disable" feature - Likewise, a large series adds uncore PMU support for Intel Diamond Rapids (DMR) CPUs (Zide Chen) This centers around these four main areas: - DMR may have two Integrated I/O and Memory Hub (IMH) dies, separate from the compute tile (CBB) dies. Each CBB and each IMH die has its own discovery domain. - Unlike prior CPUs that retrieve the global discovery table portal exclusively via PCI or MSR, DMR uses PCI for IMH PMON discovery and MSR for CBB PMON discovery. - DMR introduces several new PMON types: SCA, HAMVF, D2D_ULA, UBR, PCIE4, CRS, CPC, ITC, OTC, CMS, and PCIE6. - IIO free-running counters in DMR are MMIO-based, unlike SPR. - Also add support for Add missing PMON units for Intel Panther Lake, and support Nova Lake (NVL), which largely maps to Panther Lake. (Zide Chen) - KVM integration: Add support for mediated vPMUs (by Kan Liang and Sean Christopherson, with fixes and cleanups by Peter Zijlstra, Sandipan Das and Mingwei Zhang) - Add Intel cstate driver to support for Wildcat Lake (WCL) CPUs, which are a low-power variant of Panther Lake (Zide Chen) - Add core, cstate and MSR PMU support for the Airmont NP Intel CPU (aka MaxLinear Lightning Mountain), which maps to the existing Airmont code (Martin Schiller) Performance enhancements: - Speed up kexec shutdown by avoiding unnecessary cross CPU calls (Jan H. Schönherr) - Fix slow perf_event_task_exit() with LBR callstacks (Namhyung Kim) User-space stack unwinding support: - Various cleanups and refactorings in preparation to generalize the unwinding code for other architectures (Jens Remus) Uprobes updates: - Transition from kmap_atomic to kmap_local_page (Keke Ming) - Fix incorrect lockdep condition in filter_chain() (Breno Leitao) - Fix XOL allocation failure for 32-bit tasks (Oleg Nesterov) Misc fixes and cleanups: - s390: Remove kvm_types.h from Kbuild (Randy Dunlap) - x86/intel/uncore: Convert comma to semicolon (Chen Ni) - x86/uncore: Clean up const mismatch (Greg Kroah-Hartman) - x86/ibs: Fix typo in dc_l2tlb_miss comment (Xiang-Bin Shi)" * tag 'perf-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits) s390: remove kvm_types.h from Kbuild uprobes: Fix incorrect lockdep condition in filter_chain() x86/ibs: Fix typo in dc_l2tlb_miss comment x86/uprobes: Fix XOL allocation failure for 32-bit tasks perf/x86/intel/uncore: Convert comma to semicolon perf/x86/intel: Add support for rdpmc user disable feature perf/x86: Use macros to replace magic numbers in attr_rdpmc perf/x86/intel: Add core PMU support for Novalake perf/x86/intel: Add support for PEBS memory auxiliary info field in NVL perf/x86/intel: Add core PMU support for DMR perf/x86/intel: Add support for PEBS memory auxiliary info field in DMR perf/x86/intel: Support the 4 new OMR MSRs introduced in DMR and NVL perf/core: Fix slow perf_event_task_exit() with LBR callstacks perf/core: Speed up kexec shutdown by avoiding unnecessary cross CPU calls uprobes: use kmap_local_page() for temporary page mappings arm/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol() mips/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol() arm64/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol() riscv/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol() perf/x86/intel/uncore: Add Nova Lake support ...
2026-01-21perf/x86/intel: Do not enable BTS for guestsFernand Sieber
By default when users program perf to sample branch instructions (PERF_COUNT_HW_BRANCH_INSTRUCTIONS) with a sample period of 1, perf interprets this as a special case and enables BTS (Branch Trace Store) as an optimization to avoid taking an interrupt on every branch. Since BTS doesn't virtualize, this optimization doesn't make sense when the request originates from a guest. Add an additional check that prevents this optimization for virtualized events (exclude_host). Reported-by: Jan H. Schönherr <jschoenh@amazon.de> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Fernand Sieber <sieberf@amazon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Link: https://patch.msgid.link/20251211183604.868641-1-sieberf@amazon.com
2026-01-15perf/x86/intel/uncore: Convert comma to semicolonChen Ni
Replace comma between expressions with semicolons. Using a ',' in place of a ';' can have unintended side effects. Although that is not the case here, it is seems best to use ';' unless ',' is intended. Found by inspection. No functional change intended. Compile tested only. Fixes: e7d5f2ea0923 ("perf/x86/intel/uncore: Add Nova Lake support") Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://patch.msgid.link/20260114023652.3926117-1-nichen@iscas.ac.cn
2026-01-15perf/x86/intel: Add support for rdpmc user disable featureDapeng Mi
Starting with Panther Cove, the rdpmc user disable feature is supported. This feature allows the perf system to disable user space rdpmc reads at the counter level. Currently, when a global counter is active, any user with rdpmc rights can read it, even if perf access permissions forbid it (e.g., disallow reading ring 0 counters). The rdpmc user disable feature mitigates this security concern. Details: - A new RDPMC_USR_DISABLE bit (bit 37) in each EVNTSELx MSR indicates that the GP counter cannot be read by RDPMC in ring 3. - New RDPMC_USR_DISABLE bits in IA32_FIXED_CTR_CTRL MSR (bits 33, 37, 41, 45, etc.) for fixed counters 0, 1, 2, 3, etc. - When calling rdpmc instruction for counter x, the following pseudo code demonstrates how the counter value is obtained: If (!CPL0 && RDPMC_USR_DISABLE[x] == 1) ? 0 : counter_value; - RDPMC_USR_DISABLE is enumerated by CPUID.0x23.0.EBX[2]. This patch extends the current global user space rdpmc control logic via the sysfs interface (/sys/devices/cpu/rdpmc) as follows: - rdpmc = 0: Global user space rdpmc and counter-level user space rdpmc for all counters are both disabled. - rdpmc = 1: Global user space rdpmc is enabled during the mmap-enabled time window, and counter-level user space rdpmc is enabled only for non-system-wide events. This prevents counter data leaks as count data is cleared during context switches. - rdpmc = 2: Global user space rdpmc and counter-level user space rdpmc for all counters are enabled unconditionally. The new rdpmc settings only affect newly activated perf events; currently active perf events remain unaffected. This simplifies and cleans up the code. The default value of rdpmc remains unchanged at 1. For more details about rdpmc user disable, please refer to chapter 15 "RDPMC USER DISABLE" in ISE documentation. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260114011750.350569-8-dapeng1.mi@linux.intel.com
2026-01-15perf/x86: Use macros to replace magic numbers in attr_rdpmcDapeng Mi
Replace magic numbers in attr_rdpmc with macros to improve readability and make their meanings clearer for users. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260114011750.350569-7-dapeng1.mi@linux.intel.com
2026-01-15perf/x86/intel: Add core PMU support for NovalakeDapeng Mi
This patch enables core PMU support for Novalake, covering both P-core and E-core. It includes Arctic Wolf-specific counters and PEBS constraints, and the model-specific OMR extra registers table. Since Coyote Cove shares the same PMU capabilities as Panther Cove, the existing Panther Cove PMU enabling functions are reused for Coyote Cove. For detailed information about counter constraints, please refer to section 16.3 "COUNTER RESTRICTIONS" in the ISE documentation. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260114011750.350569-6-dapeng1.mi@linux.intel.com
2026-01-15perf/x86/intel: Add support for PEBS memory auxiliary info field in NVLDapeng Mi
Similar to DMR (Panther Cove uarch), both P-core (Coyote Cove uarch) and E-core (Arctic Wolf uarch) of NVL adopt the new PEBS memory auxiliary info layout. Coyote Cove microarchitecture shares the same PMU capabilities, including the memory auxiliary info layout, with Panther Cove. Arctic Wolf microarchitecture has a similar layout to Panther Cove, with the only difference being specific data source encoding for L2 hit cases (up to the L2 cache level). The OMR encoding remains the same as in Panther Cove. For detailed information on the memory auxiliary info encoding, please refer to section 16.2 "PEBS LOAD LATENCY AND STORE LATENCY FACILITY" in the latest ISE documentation. This patch defines Arctic Wolf specific data source encoding and then supports PEBS memory auxiliary info field for NVL. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260114011750.350569-5-dapeng1.mi@linux.intel.com
2026-01-15perf/x86/intel: Add core PMU support for DMRDapeng Mi
This patch enables core PMU features for Diamond Rapids (Panther Cove microarchitecture), including Panther Cove specific counter and PEBS constraints, a new cache events ID table, and the model-specific OMR events extra registers table. For detailed information about counter constraints, please refer to section 16.3 "COUNTER RESTRICTIONS" in the ISE documentation. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260114011750.350569-4-dapeng1.mi@linux.intel.com
2026-01-15perf/x86/intel: Add support for PEBS memory auxiliary info field in DMRDapeng Mi
With the introduction of the OMR feature, the PEBS memory auxiliary info field for load and store latency events has been restructured for DMR. The memory auxiliary info field's bit[8] indicates whether a L2 cache miss occurred for a memory load or store instruction. If bit[8] is 0, it signifies no L2 cache miss, and bits[7:0] specify the exact cache data source (up to the L2 cache level). If bit[8] is 1, bits[7:0] represent the OMR encoding, indicating the specific L3 cache or memory region involved in the memory access. A significant enhancement is OMR encoding provides up to 8 fine-grained memory regions besides the cache region. A significant enhancement for OMR encoding is the ability to provide up to 8 fine-grained memory regions in addition to the cache region, offering more detailed insights into memory access regions. For detailed information on the memory auxiliary info encoding, please refer to section 16.2 "PEBS LOAD LATENCY AND STORE LATENCY FACILITY" in the ISE documentation. This patch ensures that the PEBS memory auxiliary info field is correctly interpreted and utilized in DMR. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260114011750.350569-3-dapeng1.mi@linux.intel.com
2026-01-15perf/x86/intel: Support the 4 new OMR MSRs introduced in DMR and NVLDapeng Mi
Diamond Rapids (DMR) and Nova Lake (NVL) introduce an enhanced Off-Module Response (OMR) facility, replacing the Off-Core Response (OCR) Performance Monitoring of previous processors. Legacy microarchitectures used the OCR facility to evaluate off-core and multi-core off-module transactions. The newly named OMR facility improves OCR capabilities for scalable coverage of new memory systems in multi-core module systems. Similar to OCR, 4 additional off-module configuration MSRs (OFFMODULE_RSP_0 to OFFMODULE_RSP_3) are introduced to specify attributes of off-module transactions. When multiple identical OMR events are created, they need to occupy the same OFFMODULE_RSP_x MSR. To ensure these multiple identical OMR events can work simultaneously, the intel_alt_er() and intel_fixup_er() helpers are enhanced to rotate these OMR events across different OFFMODULE_RSP_* MSRs, similar to previous OCR events. For more details about OMR, please refer to section 16.1 "OFF-MODULE RESPONSE (OMR) FACILITY" in ISE documentation. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260114011750.350569-2-dapeng1.mi@linux.intel.com
2026-01-11treewide: Update email addressThomas Gleixner
In a vain attempt to consolidate the email zoo switch everything to the kernel.org account. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>