linux.git/tools/perf/Makefile.perf, branch v5.13

perf tools: Enable libtraceevent dynamic linking

2021-04-29T13:31:00+00:00

Currently we support only static linking with kernel's libtraceevent
(tools/lib/traceevent). This patch adds libtraceevent package detection
and support to link perf with it dynamically.

  The libtraceevent package status is displayed with:
  $ make VF=1 LIBTRACEEVENT_DYNAMIC=1
  ...
  ...                 libtraceevent: [ on  ]

Default behavior remains the same (static linking).

Committer testing:

  $ make LIBTRACEEVENT_DYNAMIC=1 VF=1 O=/tmp/build/perf -C tools/perf install-bin |& grep traceevent
  Makefile.config:1090: *** Error: No libtraceevent devel library found, please install libtraceevent-devel.  Stop.
  $

Signed-off-by: Michael Petlan 
Tested-by: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
LPU-Reference: 20210428092023.4009-1-mpetlan@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo

perf stat: Enable iostat mode for x86 platforms

2021-04-20T11:40:20+00:00

This functionality is based on recently introduced sysfs attributes for
Intel® Xeon® Scalable processor family (code name Skylake-SP):

Commit bb42b3d39781d7fc ("perf/x86/intel/uncore: Expose an Uncore unit to IIO PMON mapping")

Mode is intended to provide four I/O performance metrics in MB per each
PCIe root port:

 - Inbound Read: I/O devices below root port read from the host memory
 - Inbound Write: I/O devices below root port write to the host memory
 - Outbound Read: CPU reads from I/O devices below root port
 - Outbound Write: CPU writes to I/O devices below root port

Each metric requiries only one uncore event which increments at every 4B
transfer in corresponding direction. The formulas to compute metrics
are generic:
    #EventCount * 4B / (1024 * 1024)

Acked-by: Namhyung Kim 
Signed-off-by: Alexander Antonov 
Cc: Alexander Shishkin 
Cc: Alexey V Bayduraev 
Cc: Andi Kleen 
Cc: Arnaldo Carvalho de Melo 
Cc: Ian Rogers 
Cc: Ingo Molnar 
Cc: Jiri Olsa 
Cc: Mark Rutland 
Cc: Peter Zijlstra 
Link: https://lore.kernel.org/r/20210419094147.15909-4-alexander.antonov@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo

perf stat: Introduce 'bperf' to share hardware PMCs with BPF

2021-03-23T20:46:44+00:00

The perf tool uses performance monitoring counters (PMCs) to monitor
system performance. The PMCs are limited hardware resources. For
example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways: system
level monitoring, (maybe nested) container level monitoring, per process
monitoring, profiling (in sample mode), etc. In some cases, there are
more active perf_events than available hardware PMCs. To allow all
perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics
(cycles, instructions). It is a waste to have multiple tools create
multiple perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The
default path is /sys/fs/bpf/perf_attr_map. The user could change the
path with option --bpf-attr-map.

Committer testing:

  # dmesg|grep "Performance Events" -A5
  [    0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
  [    0.225280] ... version:                0
  [    0.225280] ... bit width:              48
  [    0.225281] ... generic registers:      6
  [    0.225281] ... value mask:             0000ffffffffffff
  [    0.225281] ... max period:             00007fffffffffff
  #
  #  for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done
  [1] 2436231
  [2] 2436232
  [3] 2436233
  [4] 2436234
  [5] 2436235
  [6] 2436236
  # perf stat -a -e cycles,instructions sleep 0.1

   Performance counter stats for 'system wide':

         310,326,987      cycles                                                        (41.87%)
         236,143,290      instructions              #    0.76  insn per cycle           (41.87%)

         0.100800885 seconds time elapsed

  #

We can see that the counters were enabled for this workload 41.87% of
the time.

Now with --bpf-counters:

  #  for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done
  [1] 2436514
  [2] 2436515
  [3] 2436516
  [4] 2436517
  [5] 2436518
  [6] 2436519
  [7] 2436520
  [8] 2436521
  [9] 2436522
  [10] 2436523
  [11] 2436524
  [12] 2436525
  [13] 2436526
  [14] 2436527
  [15] 2436528
  [16] 2436529
  [17] 2436530
  [18] 2436531
  [19] 2436532
  [20] 2436533
  [21] 2436534
  [22] 2436535
  [23] 2436536
  [24] 2436537
  [25] 2436538
  [26] 2436539
  [27] 2436540
  [28] 2436541
  [29] 2436542
  [30] 2436543
  [31] 2436544
  [32] 2436545
  #
  # ls -la /sys/fs/bpf/perf_attr_map
  -rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map
  # bpftool map | grep bperf | wc -l
  64
  #

  # bpftool map | tail
  1265: percpu_array  name accum_readings  flags 0x0
  	key 4B  value 24B  max_entries 1  memlock 4096B
  1266: hash  name filter  flags 0x0
  	key 4B  value 4B  max_entries 1  memlock 4096B
  1267: array  name bperf_fo.bss  flags 0x400
  	key 4B  value 8B  max_entries 1  memlock 4096B
  	btf_id 996
  	pids perf(2436545)
  1268: percpu_array  name accum_readings  flags 0x0
  	key 4B  value 24B  max_entries 1  memlock 4096B
  1269: hash  name filter  flags 0x0
  	key 4B  value 4B  max_entries 1  memlock 4096B
  1270: array  name bperf_fo.bss  flags 0x400
  	key 4B  value 8B  max_entries 1  memlock 4096B
  	btf_id 997
  	pids perf(2436541)
  1285: array  name pid_iter.rodata  flags 0x480
  	key 4B  value 4B  max_entries 1  memlock 4096B
  	btf_id 1017  frozen
  	pids bpftool(2437504)
  1286: array  flags 0x0
  	key 4B  value 32B  max_entries 1  memlock 4096B
  #
  # bpftool map dump id 1268 | tail
  value (CPU 21):
  8f f3 bc ca 00 00 00 00  80 fd 2a d1 4d 00 00 00
  80 fd 2a d1 4d 00 00 00
  value (CPU 22):
  7e d5 64 4d 00 00 00 00  a4 8a 2e ee 4d 00 00 00
  a4 8a 2e ee 4d 00 00 00
  value (CPU 23):
  a7 78 3e 06 01 00 00 00  b2 34 94 f6 4d 00 00 00
  b2 34 94 f6 4d 00 00 00
  Found 1 element
  # bpftool map dump id 1268 | tail
  value (CPU 21):
  c6 8b d9 ca 00 00 00 00  20 c6 fc 83 4e 00 00 00
  20 c6 fc 83 4e 00 00 00
  value (CPU 22):
  9c b4 d2 4d 00 00 00 00  3e 0c df 89 4e 00 00 00
  3e 0c df 89 4e 00 00 00
  value (CPU 23):
  18 43 66 06 01 00 00 00  5b 69 ed 83 4e 00 00 00
  5b 69 ed 83 4e 00 00 00
  Found 1 element
  # bpftool map dump id 1268 | tail
  value (CPU 21):
  f2 6e db ca 00 00 00 00  92 67 4c ba 4e 00 00 00
  92 67 4c ba 4e 00 00 00
  value (CPU 22):
  dc 8e e1 4d 00 00 00 00  d9 32 7a c5 4e 00 00 00
  d9 32 7a c5 4e 00 00 00
  value (CPU 23):
  bd 2b 73 06 01 00 00 00  7c 73 87 bf 4e 00 00 00
  7c 73 87 bf 4e 00 00 00
  Found 1 element
  #

  # perf stat --bpf-counters -a -e cycles,instructions sleep 0.1

   Performance counter stats for 'system wide':

       119,410,122      cycles
       152,105,479      instructions              #    1.27  insn per cycle

       0.101395093 seconds time elapsed

  #

See? We had the counters enabled all the time.

Signed-off-by: Song Liu 
Reviewed-by: Jiri Olsa 
Acked-by: Namhyung Kim 
Tested-by: Arnaldo Carvalho de Melo 
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo

Merge remote-tracking branch 'torvalds/master' into perf/core

2021-03-08T13:11:33+00:00

To pick up the fixes sent for v5.12 and continue development based on
v5.12-rc2, i.e. without the swap on file bug.

This also gets a slightly newer and better tools/perf/arch/arm/util/cs-etm.c
patch version, using the BIT() macro, that had already been slated to
v5.13 but ended up going to v5.12-rc1 on an older version.

Signed-off-by: Arnaldo Carvalho de Melo

perf build: Fix ccache usage in $(CC) when generating arch errno table

2021-03-06T19:54:26+00:00

This was introduced by commit e4ffd066ff440a57 ("perf: Normalize gcc
parameter when generating arch errno table").

Assuming the first word of $(CC) is the actual compiler breaks usage
like CC="ccache gcc": the script ends up calling ccache directly with
gcc arguments, what fails. Instead of getting the first word, just
remove from $(CC) any word that starts with a "-". This maintains the
spirit of the original patch, while not breaking ccache users.

Fixes: e4ffd066ff440a57 ("perf: Normalize gcc parameter when generating arch errno table")
Signed-off-by: Antonio Terceiro 
Tested-by: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: He Zhe 
Cc: Jiri Olsa 
Cc: Mark Rutland 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: stable@vger.kernel.org
Link: http://lore.kernel.org/lkml/20210224130046.346977-1-antonio.terceiro@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo

perf build: Move feature cleanup under tools/build

2021-03-06T19:54:24+00:00

Arnaldo reported issue for following build command:

  $ rm -rf /tmp/krava; mkdir /tmp/krava; make O=/tmp/krava clean
    CLEAN    config
  /bin/sh: line 0: cd: /tmp/krava/feature/: No such file or directory
  ../../scripts/Makefile.include:17: *** output directory "/tmp/krava/feature/" does not exist.  Stop.
  make[1]: *** [Makefile.perf:1010: config-clean] Error 2
  make: *** [Makefile:90: clean] Error 2

The problem is that now that we include scripts/Makefile.include
in feature's Makefile (which is fine and needed), we need to ensure
the OUTPUT directory exists, before executing (out of tree) clean
command.

Removing the feature's cleanup from perf Makefile and fixing
feature's cleanup under build Makefile, so it now checks that
there's existing OUTPUT directory before calling the clean.

Fixes: 211a741cd3e1 ("tools: Factor Clang, LLC and LLVM utils definitions")
Reported-by: Arnaldo Carvalho de Melo 
Signed-off-by: Jiri Olsa 
Tested-by: Arnaldo Carvalho de Melo 
Tested-by: Sedat Dilek  # LLVM/Clang v13-git
Cc: Alexander Shishkin 
Cc: Ian Rogers 
Cc: Mark Rutland 
Cc: Michael Petlan 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Link: http://lore.kernel.org/lkml/20210224150831.409639-1-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

perf tools: Enable warnings when compiling BPF programs

2021-03-06T19:42:16+00:00

Add -Wall -Werror when compiling BPF skeletons.

Signed-off-by: Ian Rogers 
Acked-by: Song Liu 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Mark Rutland 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Link: http://lore.kernel.org/lkml/20210306080840.3785816-2-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo

Merge tag 'perf-tools-for-v5.12-2020-02-19' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux

2021-02-22T21:59:43+00:00

Pull perf tool updates from Arnaldo Carvalho de Melo:
"New features:

- Support instruction latency in 'perf report', with both memory
latency (weight) and instruction latency information, users can
locate expensive load instructions and understand time spent in
different stages.

- Extend 'perf c2c' to display the number of loads which were blocked
by data or address conflict.

- Add 'perf stat' support for L2 topdown events in systems such as
Intel's Sapphire rapids server.

- Add support for PERF_SAMPLE_CODE_PAGE_SIZE in various tools, as a
sort key, for instance:

perf report --stdio --sort=comm,symbol,code_page_size

- New 'perf daemon' command to run long running sessions while
providing a way to control the enablement of events without
restarting a traditional 'perf record' session.

- Enable counting events for BPF programs in 'perf stat' just like
for other targets (tid, cgroup, cpu, etc), e.g.:

# perf stat -e ref-cycles,cycles -b 254 -I 1000
1.487903822 115,200 ref-cycles
1.487903822 86,012 cycles
2.489147029 80,560 ref-cycles
2.489147029 73,784 cycles
^C

The example above counts 'cycles' and 'ref-cycles' of BPF program
of id 254. It is similar to bpftool-prog-profile command, but more
flexible.

- Support the new layout for PERF_RECORD_MMAP2 to carry the DSO
build-id using infrastructure generalised from the eBPF subsystem,
removing the need for traversing the perf.data file to collect
build-ids at the end of 'perf record' sessions and helping with
long running sessions where binaries can get replaced in updates,
leading to possible mis-resolution of symbols.

- Support filtering by hex address in 'perf script'.

- Support DSO filter in 'perf script', like in other perf tools.

- Add namespaces support to 'perf inject'

- Add support for SDT (Dtrace Style Markers) events on ARM64.

perf record:

- Fix handling of eventfd() when draining a buffer in 'perf record'.

- Improvements to the generation of metadata events for pre-existing
threads (mmaps, comm, etc), speeding up the work done at the start
of system wide or per CPU 'perf record' sessions.

Hardware tracing:

- Initial support for tracing KVM with Intel PT.

- Intel PT fixes for IPC

- Support Intel PT PSB (synchronization packets) events.

- Automatically group aux-output events to overcome --filter syntax.

- Enable PERF_SAMPLE_DATA_SRC on ARMs SPE.

- Update ARM's CoreSight hardware tracing OpenCSD library to v1.0.0.

perf annotate TUI:

- Fix handling of 'k' ("show line number") hotkey

- Fix jump parsing for C++ code.

perf probe:

- Add protection to avoid endless loop.

cgroups:

- Avoid reading cgroup mountpoint multiple times, caching it.

- Fix handling of cgroup v1/v2 in mixed hierarchy.

Symbol resolving:

- Add OCaml symbol demangling.

- Further fixes for handling PE executables when using perf with Wine
and .exe/.dll files.

- Fix 'perf unwind' DSO handling.

- Resolve symbols against debug file first, to deal with artifacts
related to LTO.

- Fix gap between kernel end and module start on powerpc.

Reporting tools:

- The DSO filter shouldn't show samples in unresolved maps.

- Improve debuginfod support in various tools.

build ids:

- Fix 16-byte build ids in 'perf buildid-cache', add a 'perf test'
entry for that case.

perf test:

- Support for PERF_SAMPLE_WEIGHT_STRUCT.

- Add test case for PERF_SAMPLE_CODE_PAGE_SIZE.

- Shell based tests for 'perf daemon's commands ('start', 'stop,
'reconfig', 'list', etc).

- ARM cs-etm 'perf test' fixes.

- Add parse-metric memory bandwidth testcase.

Compiler related:

- Fix 'perf probe' kretprobe issue caused by gcc 11 bug when used
with -fpatchable-function-entry.

- Fix ARM64 build with gcc 11's -Wformat-overflow.

- Fix unaligned access in sample parsing test.

- Fix printf conversion specifier for IP addresses on arm64, s390 and
powerpc.

Arch specific:

- Support exposing Performance Monitor Counter SPRs as part of
extended regs on powerpc.

- Add JSON 'perf stat' metrics for ARM64's imx8mp, imx8mq and imx8mn
DDR, fix imx8mm ones.

- Fix common and uarch events for ARM64's A76 and Ampere eMag"

* tag 'perf-tools-for-v5.12-2020-02-19' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (148 commits)
perf buildid-cache: Don't skip 16-byte build-ids
perf buildid-cache: Add test for 16-byte build-id
perf symbol: Remove redundant libbfd checks
perf test: Output the sub testing result in cs-etm
perf test: Suppress logs in cs-etm testing
perf tools: Fix arm64 build error with gcc-11
perf intel-pt: Add documentation for tracing virtual machines
perf intel-pt: Split VM-Entry and VM-Exit branches
perf intel-pt: Adjust sample flags for VM-Exit
perf intel-pt: Allow for a guest kernel address filter
perf intel-pt: Support decoding of guest kernel
perf machine: Factor out machine__idle_thread()
perf machine: Factor out machines__find_guest()
perf intel-pt: Amend decoder to track the NR flag
perf intel-pt: Retain the last PIP packet payload as is
perf intel_pt: Add vmlaunch and vmresume as branches
perf script: Add branch types for VM-Entry and VM-Exit
perf auxtrace: Automatically group aux-output events
perf test: Fix unaligned access in sample parsing test
perf tools: Support arch specific PERF_SAMPLE_WEIGHT_STRUCT processing
...

tools: Factor Clang, LLC and LLVM utils definitions

2021-01-29T00:25:34+00:00

When dealing with BPF/BTF/pahole and DWARF v5 I wanted to build bpftool.

While looking into the source code I found duplicate assignments in misc tools
for the LLVM eco system, e.g. clang and llvm-objcopy.

Move the Clang, LLC and/or LLVM utils definitions to tools/scripts/Makefile.include
file and add missing includes where needed. Honestly, I was inspired by the commit
c8a950d0d3b9 ("tools: Factor HOSTCC, HOSTLD, HOSTAR definitions").

I tested with bpftool and perf on Debian/testing AMD64 and LLVM/Clang v11.1.0-rc1.

Build instructions:

[ make and make-options ]
MAKE="make V=1"
MAKE_OPTS="HOSTCC=clang HOSTCXX=clang++ HOSTLD=ld.lld CC=clang LD=ld.lld LLVM=1 LLVM_IAS=1"
MAKE_OPTS="$MAKE_OPTS PAHOLE=/opt/pahole/bin/pahole"

[ clean-up ]
$MAKE $MAKE_OPTS -C tools/ clean

[ bpftool ]
$MAKE $MAKE_OPTS -C tools/bpf/bpftool/

[ perf ]
PYTHON=python3 $MAKE $MAKE_OPTS -C tools/perf/

I was careful with respecting the user's wish to override custom compiler, linker,
GNU/binutils and/or LLVM utils settings.

Signed-off-by: Sedat Dilek 
Signed-off-by: Daniel Borkmann 
Acked-by: Andrii Nakryiko 
Acked-by: Jiri Olsa  # tools/build and tools/perf
Link: https://lore.kernel.org/bpf/20210128015117.20515-1-sedat.dilek@gmail.com

perf stat: Enable counting events for BPF programs

2021-01-20T17:25:28+00:00

Introduce 'perf stat -b' option, which counts events for BPF programs, like:

  [root@localhost ~]# ~/perf stat -e ref-cycles,cycles -b 254 -I 1000
     1.487903822            115,200      ref-cycles
     1.487903822             86,012      cycles
     2.489147029             80,560      ref-cycles
     2.489147029             73,784      cycles
     3.490341825             60,720      ref-cycles
     3.490341825             37,797      cycles
     4.491540887             37,120      ref-cycles
     4.491540887             31,963      cycles

The example above counts 'cycles' and 'ref-cycles' of BPF program of id
254.  This is similar to bpftool-prog-profile command, but more
flexible.

'perf stat -b' creates per-cpu perf_event and loads fentry/fexit BPF
programs (monitor-progs) to the target BPF program (target-prog). The
monitor-progs read perf_event before and after the target-prog, and
aggregate the difference in a BPF map. Then the user space reads data
from these maps.

A new 'struct bpf_counter' is introduced to provide a common interface
that uses BPF programs/maps to count perf events.

Committer notes:

Removed all but bpf_counter.h includes from evsel.h, not needed at all.

Also BPF map lookups for PERCPU_ARRAYs need to have as its value receive
buffer passed to the kernel libbpf_num_possible_cpus() entries, not
evsel__nr_cpus(evsel), as the former uses
/sys/devices/system/cpu/possible while the later uses
/sys/devices/system/cpu/online, which may be less than the 'possible'
number making the bpf map lookup overwrite memory and cause hard to
debug memory corruption.

We need to continue using evsel__nr_cpus(evsel) when accessing the
perf_counts array tho, not to overwrite another are of memory :-)

Signed-off-by: Song Liu 
Tested-by: Arnaldo Carvalho de Melo 
Link: https://lore.kernel.org/lkml/20210120163031.GU12699@kernel.org/
Acked-by: Namhyung Kim 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Mark Rutland 
Cc: Peter Zijlstra 
Cc: kernel-team@fb.com
Link: http://lore.kernel.org/lkml/20201229214214.3413833-4-songliubraving@fb.com
Signed-off-by: Arnaldo Carvalho de Melo