perf tools changes and fixes for v6.5: 1st batch

Internal cleanup:

 - Refactor PMU data management to handle hybrid systems in a generic way.
   Do more work in the lexer so that legacy event types parse more easily.
   A side-effect of this is that if a PMU is specified, scanning sysfs is
   avoided improving start-up time.

 - Fix hybrid metrics, for example, the TopdownL1 works for both performance
   and efficiency cores on Intel machines.  To support this, sort and regroup
   events after parsing.

 - Add reference count checking for the 'thread' data structure.

 - Lots of fixes for memory leaks in various places thanks to the ASAN and
   Ian's refcount checker.

 - Reduce the binary size by replacing static variables with local or
   dynamically allocated memory.

 - Introduce shared_mutex for annotate data to reduce memory footprint.

 - Make filesystem access library functions more thread safe.

Test:

 - Organize cpu_map tests into a single suite.

 - Add metric value validation test to check if the values are within correct
   value ranges.

 - Add perf stat stdio output test to check if event and metric names match.

 - Add perf data converter JSON output test.

 - Fix a lot of issues reported by shellcheck(1).  This is a preparation to
   enable shellcheck by default.

 - Make the large x86 new instructions test optional at build time using
   EXTRA_TESTS=1.

 - Add a test for libpfm4 events.

perf script:

 - Add 'dsoff' outpuf field to display offset from the DSO.

    $ perf script -F comm,pid,event,ip,dsoff
       ls 2695501 cycles:      152cc73ef4b5 (/usr/lib/x86_64-linux-gnu/ld-2.31.so+0x1c4b5)
       ls 2695501 cycles:  ffffffff99045b3e ([kernel.kallsyms])
       ls 2695501 cycles:  ffffffff9968e107 ([kernel.kallsyms])
       ls 2695501 cycles:  ffffffffc1f54afb ([kernel.kallsyms])
       ls 2695501 cycles:  ffffffff9968382f ([kernel.kallsyms])
       ls 2695501 cycles:  ffffffff99e00094 ([kernel.kallsyms])
       ls 2695501 cycles:      152cc718a8d0 (/usr/lib/x86_64-linux-gnu/libselinux.so.1+0x68d0)
       ls 2695501 cycles:  ffffffff992a6db0 ([kernel.kallsyms])

 - Adjust width for large PID/TID values.

perf report:

 - Robustify reading addr2line output for srcline by checking sentinel output
   before the actual data and by using timeout of 1 second.

 - Allow config terms (like 'name=ABC') with breakpoint events.

    $ perf record -e mem:0x55feb98dd169:x/name=breakpoint/ -p 19646 -- sleep 1

perf annotate:

 - Handle x86 instruction suffix like 'l' in 'movl' generally.

 - Parse instruction operands properly even with a whitespace.  This is needed
   for llvm-objdump output.

 - Support RISC-V binutils lookup using the triplet prefixes.

 - Add '<' and '>' key to navigate to prev/next symbols in TUI.

 - Fix instruction association and parsing for LoongArch.

perf stat:

 - Add --per-cache aggregation option, optionally specify a cache level
   like `--per-cache=L2`.

    $ sudo perf stat --per-cache -a -e ls_dmnd_fills_from_sys.ext_cache_remote --\
      taskset -c 0-15,64-79,128-143,192-207\
      perf bench sched messaging -p -t -l 100000 -g 8

      # Running 'sched/messaging' benchmark:
      # 20 sender and receiver threads per group
      # 8 groups == 320 threads run

      Total time: 7.648 [sec]

      Performance counter stats for 'system wide':

      S0-D0-L3-ID0             16         17,145,912      ls_dmnd_fills_from_sys.ext_cache_remote
      S0-D0-L3-ID8             16         14,977,628      ls_dmnd_fills_from_sys.ext_cache_remote
      S0-D0-L3-ID16            16            262,539      ls_dmnd_fills_from_sys.ext_cache_remote
      S0-D0-L3-ID24            16              3,140      ls_dmnd_fills_from_sys.ext_cache_remote
      S0-D0-L3-ID32            16             27,403      ls_dmnd_fills_from_sys.ext_cache_remote
      S0-D0-L3-ID40            16             17,026      ls_dmnd_fills_from_sys.ext_cache_remote
      S0-D0-L3-ID48            16              7,292      ls_dmnd_fills_from_sys.ext_cache_remote
      S0-D0-L3-ID56            16              2,464      ls_dmnd_fills_from_sys.ext_cache_remote
      S1-D1-L3-ID64            16         22,489,306      ls_dmnd_fills_from_sys.ext_cache_remote
      S1-D1-L3-ID72            16         21,455,257      ls_dmnd_fills_from_sys.ext_cache_remote
      S1-D1-L3-ID80            16             11,619      ls_dmnd_fills_from_sys.ext_cache_remote
      S1-D1-L3-ID88            16             30,978      ls_dmnd_fills_from_sys.ext_cache_remote
      S1-D1-L3-ID96            16             37,628      ls_dmnd_fills_from_sys.ext_cache_remote
      S1-D1-L3-ID104           16             13,594      ls_dmnd_fills_from_sys.ext_cache_remote
      S1-D1-L3-ID112           16             10,164      ls_dmnd_fills_from_sys.ext_cache_remote
      S1-D1-L3-ID120           16             11,259      ls_dmnd_fills_from_sys.ext_cache_remote

            7.779171484 seconds time elapsed

  - Change default (no event/metric) formatting for default metrics so that
    events are hidden and the metric and group appear.

     Performance counter stats for 'ls /':

                  1.85 msec task-clock                       #    0.594 CPUs utilized
                     0      context-switches                 #    0.000 /sec
                     0      cpu-migrations                   #    0.000 /sec
                    97      page-faults                      #   52.517 K/sec
             2,187,173      cycles                           #    1.184 GHz
             2,474,459      instructions                     #    1.13  insn per cycle
               531,584      branches                         #  287.805 M/sec
                13,626      branch-misses                    #    2.56% of all branches
                            TopdownL1                 #     23.5 %  tma_backend_bound
                                                      #     11.5 %  tma_bad_speculation
                                                      #     39.1 %  tma_frontend_bound
                                                      #     25.9 %  tma_retiring

 - Allow --cputype option to have any PMU name (not just hybrid).

 - Fix output value not to added when it runs multiple times with -r option.

perf list:

 - Show metricgroup description from JSON file called metricgroups.json.

 - Allow 'pfm' argument to list only libpfm4 events and check each event is
   supported before showing it.

JSON vendor events:

 - Avoid event grouping using "NO_GROUP_EVENTS" constraints.  The topdown
   events are correctly grouped even if no group exists.

 - Add "Default" metric group to print it in the default output.  And use
   "DefaultMetricgroupName" to indicate the real metric group name.

 - Add AmpereOne core PMU events.

Misc:

 - Define man page date correctly.

 - Track exception level properly on ARM CoreSight ETM.

 - Allow anonymous struct, union or enum when retrieving type names from DWARF.

 - Fix incorrect filename when calling `perf inject --jit`.

 - Handle PLT size correctly on LoongArch.
perf test: Skip metrics w/o event name in stat STD output linter

This test checks if the output of perf stat to match event names and
metrics.  So it wants the output lines to have both event name and
metric.  Otherwise it should skip the line.

On AMD machines, the instruction event has two metrics and they are printed
in separate lines.  It makes the line without event name like below:

  # perf stat -a sleep 1

   Performance counter stats for 'system wide':

           64,383.34 msec cpu-clock                  #   64.048 CPUs utilized
              14,526      context-switches           #  225.617 /sec
                 112      cpu-migrations             #    1.740 /sec
                 190      page-faults                #    2.951 /sec
         807,558,652      cycles                     #    0.013 GHz                         (83.30%)
          69,809,799      stalled-cycles-frontend    #    8.64% frontend cycles idle        (83.30%)
         196,983,266      stalled-cycles-backend     #   24.39% backend cycles idle         (83.30%)
         424,876,008      instructions               #    0.53  insn per cycle
 (here) --->                                  #    0.46  stalled cycles per insn     (83.30%)
          97,788,321      branches                   #    1.519 M/sec                       (83.34%)
           4,147,377      branch-misses              #    4.24% of all branches             (83.46%)

         1.005241409 seconds time elapsed

Also modern Intel machines have TopDown metrics which also don't have
event names.

  # perf stat -a sleep 1

   Performance counter stats for 'system wide':

            8,015.39 msec cpu-clock                        #    7.996 CPUs utilized
               5,823      context-switches                 #  726.477 /sec
                 189      cpu-migrations                   #   23.580 /sec
                 139      page-faults                      #   17.342 /sec
         435,139,308      cycles                           #    0.054 GHz
         193,891,345      instructions                     #    0.45  insn per cycle
          42,773,028      branches                         #    5.336 M/sec
           2,298,113      branch-misses                    #    5.37% of all branches
                          TopdownL1                 #     25.5 %  tma_backend_bound
              /-->                                  #      7.9 %  tma_bad_speculation
    (here) --+                                      #     55.7 %  tma_frontend_bound
              \-->                                  #     10.9 %  tma_retiring

         1.002395924 seconds time elapsed

There is a check to skip TopdownL1 and TopdownL2 specifically but it
does not cover every affected lines.

So there is another check to skip the line if it has nothing on the left
side of # sign.  Well.. it seems ok but that's not enough too.

When aggregation mode (like --per-socket or --per-thread) is used, it
adds some prefix (e.g. CPU socket, task name and PID) in the output
line.  So the test code ignores them to normalize result.

A problem can happen for per-thread mode when task name contains one or
more spaces.  It'd only ignore the first part of the task name, and it
thinks there's something more in the line so it would not skip.

  # perf stat -a --perf-thread sleep 1
  ...
            perf-21276                  #     70.2 %  tma_backend_bound
            perf-21276                  #      3.9 %  tma_bad_speculation
            perf-21276                  #     10.5 %  tma_frontend_bound
            perf-21276                  #     15.3 %  tma_retiring
	    ^^^^^^^^^^
	    (ignored)

         my task-21328                  #     70.2 %  tma_backend_bound
         my task-21328                  #      3.9 %  tma_bad_speculation
         my task-21328                  #     10.5 %  tma_frontend_bound
         my task-21328                  #     15.3 %  tma_retiring
	 ^^
     (ignored)

So I think it should look at the metric names instead.  Add skip_metric
to hold the list of names to skip.  It would contain 'stalled cycles per
insn' and metrics started by 'tma_'.

Fixes: 99a04a48f225 ("perf test: Add test case for the standard 'perf stat' output")
Acked-by: Ian Rogers <irogers@google.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230623230139.985594-2-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
1 file changed