perf tools changes for v6.3:

- 'perf lock contention' improvements:

  - Add -o/--lock-owner option:

  $ sudo ./perf lock contention -abo -- ./perf bench sched pipe
  # Running 'sched/pipe' benchmark:
  # Executed 1000000 pipe operations between two processes

       Total time: 4.766 [sec]

         4.766540 usecs/op
           209795 ops/sec
   contended   total wait     max wait     avg wait          pid   owner

         403    565.32 us     26.81 us      1.40 us           -1   Unknown
           4     27.99 us      8.57 us      7.00 us      1583145   sched-pipe
           1      8.25 us      8.25 us      8.25 us      1583144   sched-pipe
           1      2.03 us      2.03 us      2.03 us         5068   chrome

   The owner is unknown in most cases.  Filtering only for the mutex locks, it
   will more likely get the owners.

  - -S/--callstack-filter is to limit display entries having the given
   string in the callstack

  $ sudo ./perf lock contention -abv -S net sleep 1
  ...
   contended   total wait     max wait     avg wait         type   caller

           5     70.20 us     16.13 us     14.04 us     spinlock   __dev_queue_xmit+0xb6d
                          0xffffffffa5dd1c60  _raw_spin_lock+0x30
                          0xffffffffa5b8f6ed  __dev_queue_xmit+0xb6d
                          0xffffffffa5cd8267  ip6_finish_output2+0x2c7
                          0xffffffffa5cdac14  ip6_finish_output+0x1d4
                          0xffffffffa5cdb477  ip6_xmit+0x457
                          0xffffffffa5d1fd17  inet6_csk_xmit+0xd7
                          0xffffffffa5c5f4aa  __tcp_transmit_skb+0x54a
                          0xffffffffa5c6467d  tcp_keepalive_timer+0x2fd

  Please note that to have the -b option (BPF) working above one has to build
  with BUILD_BPF_SKEL=1.

  - Add more 'perf test' entries to test these new features.

- Add Ian Rogers to MAINTAINERS as a perf tools reviewer.

- Add support for retire latency feature (pipeline stall of a instruction
  compared to the previous one, in cycles) present on some Intel processors.

- Add 'perf c2c' report option to show false sharing with adjacent cachelines, to
  be used in machines with cacheline prefetching, where accesses to a cacheline
  brings the next one too.

- Skip 'perf test bpf' when the required kernel-debuginfo package isn't installed.

perf script:

- Add 'cgroup' field for 'perf script' output:

  $ perf record --all-cgroups -- true
  $ perf script -F comm,pid,cgroup
            true 337112  /user.slice/user-657345.slice/user@657345.service/...
            true 337112  /user.slice/user-657345.slice/user@657345.service/...
            true 337112  /user.slice/user-657345.slice/user@657345.service/...
            true 337112  /user.slice/user-657345.slice/user@657345.service/...

- Add support for showing branch speculation information in 'perf
  script' and in the 'perf report' raw dump (-D).

perf record:

- Fix 'perf record' segfault with --overwrite and --max-size.

Intel PT:

- Add support for synthesizing "cycle" events from Intel PT traces as we
  support "instruction" events when Intel PT CYC packets are available. This
  enables much more accurate profiles than when using the regular 'perf record -e
  cycles' (the default) when the workload lasts for very short periods (<10ms).

- .plt symbol handling improvements, better handling IBT (in the past
  MPX) done in the context of decoding Intel PT processor traces, IFUNC
  symbols on x86_64, static executables, understanding .plt.got symbols on
  x86_64.

- Add a 'perf test' to test symbol resolution, part of the .plt
  improvements series, this tests things like symbol size in contexts
  where only the symbol start is available (kallsyms), etc.

- Better handle auxtrace/Intel PT data when using pipe mode (perf record sleep 1|perf report).

- Fix symbol lookup with kcore with multiple segments match stext,
  getting the symbol resolution to just show DSOs as unknown.

ARM:

- Timestamp improvements for ARM64 systems with ETMv4 (Embedded Trace
  Macrocell v4).

- Ensure ARM64 CoreSight timestamps don't go backwards.

- Document that ARM64 SPE (Statistical Profiling Extension) is used with 'perf c2c/mem'.

- Add raw decoding for ARM64 SPEv1.2 previous branch address.

- Update neoverse-n2-v2 ARM vendor events (JSON tables): topdown L1, TLB,
  cache, branch, PE utilization and instruction mix metrics.

- Update decoder code for OpenCSD version 1.4, on ARM64 systems.

- Fix command line auto-complete of CPU events on aarch64.

perf test/bench:

- Switch basic BPF filtering test to use syscall tracepoint to avoid the
  variable number of probes inserted when using the previous probe point
  (do_epoll_wait) that happens on different CPU architectures.

- Fix DWARF unwind test by adding non-inline to expected function in a
  backtrace.

- Use 'grep -c' where the longer form 'grep | wc -l' was being used.

- Add getpid and execve benchmarks to 'perf bench syscall'.

Miscellaneous:

- Avoid d3-flame-graph package dependency in 'perf script flamegraph',
  making this feature more generally available.

- Add JSON metric events to present CPI stall cycles in Power10.

- Assorted improvements/refactorings on the JSON metrics parsing code.

Build:

- Fix 'perf probe' and 'perf test' when libtraceevent isn't linked, as
  several tests use tracepoints, those should be skipped.

- More fallout fixes for the removal of tools/lib/traceevent/.

- Fix build error when linking with libpfm.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf tests stat_all_metrics: Change true workload to sleep workload for system wide check

Testcase stat_all_metrics.sh fails in powerpc:

98: perf all metrics test : FAILED!

Logs with verbose:

  [command]# ./perf test 98 -vv
   98: perf all metrics test                                           :
   --- start ---
  test child forked, pid 13262
  Testing BRU_STALL_CPI
  Testing COMPLETION_STALL_CPI
   ----
  Testing TOTAL_LOCAL_NODE_PUMPS_P23
  Metric 'TOTAL_LOCAL_NODE_PUMPS_P23' not printed in:
  Error:
  Invalid event (hv_24x7/PM_PB_LNS_PUMP23,chip=3/) in per-thread mode, enable system wide with '-a'.
  Testing TOTAL_LOCAL_NODE_PUMPS_RETRIES_P01
  Metric 'TOTAL_LOCAL_NODE_PUMPS_RETRIES_P01' not printed in:
  Error:
  Invalid event (hv_24x7/PM_PB_RTY_LNS_PUMP01,chip=3/) in per-thread mode, enable system wide with '-a'.
   ----

Based on above logs, we could see some of the hv-24x7 metric events
fails, and logs suggest to run the metric event with -a option.  This
change happened after the commit a4b8cfcabb1d90ec ("perf stat: Delay
metric parsing"), which delayed the metric parsing phase and now before
metric parsing phase perf tool identifies, whether target is system-wide
or not. With this change, perf_event_open will fails with workload
monitoring for uncore events as expected.

The perf all metric test case fails as some of the hv-24x7 metric events
may need bigger workload with system wide monitoring to get the data.
Fix this issue by changing current system wide check from true workload
to sleep 0.01 workload.

Result with the patch changes in powerpc:

  98: perf all metrics test : Ok

Fixes: a4b8cfcabb1d90ec ("perf stat: Delay metric parsing")
Suggested-by: Ian Rogers <irogers@google.com>
Reviewed-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
Tested-by: Disha Goel <disgoel@linux.ibm.com>
Tested-by: Ian Rogers <irogers@google.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Nageswara R Sastry <rnsastry@linux.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: https://lore.kernel.org/r/20230215093827.124921-1-kjain@linux.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
1 file changed