| The perfmon hardware monitoring interface |
| ------------------------------------------ |
| Stephane Eranian |
| <eranian@gmail.com> |
| |
| I/ Introduction |
| |
| The perfmon interface provides access to the hardware performance counters |
| of major processors. Nowadays, all processors implement some flavor of |
| performance counters which capture micro-architectural level information |
| such as the number of elapsed cycles, number of cache misses, and so on. |
| |
| The interface is implemented as a set of new system calls and a set of |
| config files in /sys. |
| |
| It is possible to monitor a single thread or a CPU. In either mode, |
| applications can count or sample. System-wide monitoring is supported by |
| running a monitoring session on each CPU. The interface supports event-based |
| sampling where the sampling period is expressed as the number of occurrences |
| of event, instead of just a timeout. This approach provides a better |
| granularity and flexibility. |
| |
| For performance reason, it is possible to use a kernel-level sampling buffer |
| to minimize the overhead incurred by sampling. The format of the buffer, |
| what is recorded, how it is recorded, and how it is exported to user is |
| controlled by a kernel module called a sampling format. The current |
| implementation comes with a default format but it is possible to create |
| additional formats. There is an kernel registration interface for formats. |
| Each format is identified by a simple string which a tool can pass when a |
| monitoring session is created. |
| |
| The interface also provides support for event set and multiplexing to work |
| around hardware limitations in the number of available counters or in how |
| events can be combined. Each set defines as many counters as the hardware |
| can support. The kernel then multiplexes the sets. The interface supports |
| time-based switching but also overflow-based switching, i.e., after n |
| overflows of designated counters. |
| |
| Applications never manipulates the actual performance counter registers. |
| Instead they see a logical Performance Monitoring Unit (PMU) composed of a |
| set of config registers (PMC) and a set of data registers (PMD). Note that |
| PMD are not necessarily counters, they can be buffers. The logical PMU is |
| then mapped onto the actual PMU using a mapping table which is implemented |
| as a kernel module. The mapping is chosen once for each new processor. It is |
| visible in /sys/kernel/perfmon/pmu_desc. The kernel module is automatically |
| loaded on first use. |
| |
| A monitoring session is uniquely identified by a file descriptor obtained |
| when the session is created. File sharing semantics apply to access the |
| session inside a process. A session is never inherited across fork. The file |
| descriptor can be used to receive counter overflow notifications or when the |
| sampling buffer is full. It is possible to use poll/select on the descriptor |
| to wait for notifications from multiple sessions. Similarly, the descriptor |
| supports asynchronous notifications via SIGIO. |
| |
| Counters are always exported as being 64-bit wide regardless of what the |
| underlying hardware implements. |
| |
| II/ Kernel compilation |
| |
| To enable perfmon, you need to enable CONFIG_PERFMON and also some of the |
| model-specific PMU modules. |
| |
| III/ OProfile interactions |
| |
| The set of features offered by perfmon is rich enough to support migrating |
| Oprofile on top of it. That means that PMU programming and low-level |
| interrupt handling could be done by perfmon. The Oprofile sampling buffer |
| management code in the kernel as well as how samples are exported to users |
| could remain through the use of a sampling format. This is how Oprofile |
| works on Itanium. |
| |
| The current interactions with Oprofile are: |
| - on X86: Both subsystems can be compiled into the same kernel. There |
| is enforced mutual exclusion between the two subsystems. When |
| there is an Oprofile session, no perfmon session can exist |
| and vice-versa. |
| |
| - On IA-64: Oprofile works on top of perfmon. Oprofile being a |
| system-wide monitoring tool, the regular per-thread vs. |
| system-wide session restrictions apply. |
| |
| - on PPC: no integration yet. Only one subsystem can be enabled. |
| - on MIPS: no integration yet. Only one subsystem can be enabled. |
| |
| IV/ User tools |
| |
| We have released a simple monitoring tool to demonstrate the features of |
| the interface. The tool is called pfmon and it comes with a simple helper |
| library called libpfm. The library comes with a set of examples to show |
| how to use the kernel interface. Visit http://perfmon2.sf.net for details. |
| |
| There maybe other tools available for perfmon. |
| |
| V/ How to program? |
| |
| The best way to learn how to program perfmon, is to take a look at the |
| source code for the examples in libpfm. The source code is available from: |
| |
| http://perfmon2.sf.net |
| |
| VI/ System calls overview |
| |
| The interface is implemented by the following system calls: |
| |
| * int pfm_create_context(pfarg_ctx_t *ctx, char *fmt, void *arg, size_t arg_size) |
| |
| This function create a perfmon2 context. The type of context is per-thread by |
| default unless PFM_FL_SYSTEM_WIDE is passed in ctx. The sampling format name |
| is passed in fmt. Arguments to the format are passed in arg which is of size |
| arg_size. Upon successful return, the file descriptor identifying the context |
| is returned. |
| |
| * int pfm_write_pmds(int fd, pfarg_pmd_t *pmds, int n) |
| |
| This function is used to program the PMD registers. It is possible to pass |
| vectors of PMDs. |
| |
| * int pfm_write_pmcs(int fd, pfarg_pmc_t *pmds, int n) |
| |
| This function is used to program the PMC registers. It is possible to pass |
| vectors of PMDs. |
| |
| * int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n) |
| |
| This function is used to read the PMD registers. It is possible to pass |
| vectors of PMDs. |
| |
| * int pfm_load_context(int fd, pfarg_load_t *load) |
| |
| This function is used to attach the context to a thread or CPU. |
| Thread means kernel-visible thread (NPTL). The thread identification |
| as obtained by gettid must be passed to load->load_target. |
| |
| To operate on another thread (not self), it is mandatory that the thread |
| be stopped via ptrace(). |
| |
| To attach to a CPU, the CPU number must be specified in load->load_target |
| AND the call must be issued on that CPU. To monitor a CPU, a thread MUST |
| be pinned on that CPU. |
| |
| Until the context is attached, the actual counters are not accessed. |
| |
| * int pfm_unload_context(int fd) |
| |
| The context is detached for the thread or CPU is was attached to. |
| As a consequence monitoring is stopped. |
| |
| When monitoring another thread, the thread MUST be stopped via ptrace() |
| for this function to succeed. |
| |
| * int pfm_start(int fd, pfarg_start_t *st) |
| |
| Start monitoring. The context must be attached for this function to succeed. |
| Optionally, it is possible to specify the event set on which to start using the |
| st argument, otherwise just pass NULL. |
| |
| When monitoring another thread, the thread MUST be stopped via ptrace() |
| for this function to succeed. |
| |
| * int pfm_stop(int fd) |
| |
| Stop monitoring. The context must be attached for this function to succeed. |
| |
| When monitoring another thread, the thread MUST be stopped via ptrace() |
| for this function to succeed. |
| |
| |
| * int pfm_create_evtsets(int fd, pfarg_setdesc_t *sets, int n) |
| |
| This function is used to create or change event sets. By default set 0 exists. |
| It is possible to create/change multiple sets in one call. |
| |
| The context must be detached for this call to succeed. |
| |
| Sets are identified by a 16-bit integer. They are sorted based on this |
| set and switching occurs in a round-robin fashion. |
| |
| * int pfm_delete_evtsets(int fd, pfarg_setdesc_t *sets, int n) |
| |
| Delete event sets. The context must be detached for this call to succeed. |
| |
| |
| * int pfm_getinfo_evtsets(int fd, pfarg_setinfo_t *sets, int n) |
| |
| Retrieve information about event sets. In particular it is possible |
| to get the number of activation of a set. It is possible to retrieve |
| information about multiple sets in one call. |
| |
| |
| * int pfm_restart(int fd) |
| |
| Indicate to the kernel that the application is done processing an overflow |
| notification. A consequence of this call could be that monitoring resumes. |
| |
| * int read(fd, pfm_msg_t *msg, sizeof(pfm_msg_t)) |
| |
| the regular read() system call can be used with the context file descriptor to |
| receive overflow notification messages. Non-blocking read() is supported. |
| |
| Each message carry information about the overflow such as which counter overflowed |
| and where the program was (interrupted instruction pointer). |
| |
| * int close(int fd) |
| |
| To destroy a context, the regular close() system call is used. |
| |
| |
| VII/ /sys interface overview |
| |
| Refer to Documentation/ABI/testing/sysfs-perfmon-* for a detailed description |
| of the sysfs interface of perfmon2. |
| |
| VIII/ debugfs interface overview |
| |
| Refer to Documentation/perfmon2-debugfs.txt for a detailed description of the |
| debug and statistics interface of perfmon2. |
| |
| IX/ Documentation |
| |
| Visit http://perfmon2.sf.net |