blob: 330166b74fd160be08c12db28dbc3e264e6710f1 [file] [log] [blame]
The perfmon hardware monitoring interface
------------------------------------------
Stephane Eranian
<eranian@gmail.com>
I/ Introduction
The perfmon interface provides access to the hardware performance counters
of major processors. Nowadays, all processors implement some flavor of
performance counters which capture micro-architectural level information
such as the number of elapsed cycles, number of cache misses, and so on.
The interface is implemented as a set of new system calls and a set of
config files in /sys.
It is possible to monitor a single thread or a CPU. In either mode,
applications can count or sample. System-wide monitoring is supported by
running a monitoring session on each CPU. The interface supports event-based
sampling where the sampling period is expressed as the number of occurrences
of event, instead of just a timeout. This approach provides a better
granularity and flexibility.
For performance reason, it is possible to use a kernel-level sampling buffer
to minimize the overhead incurred by sampling. The format of the buffer,
what is recorded, how it is recorded, and how it is exported to user is
controlled by a kernel module called a sampling format. The current
implementation comes with a default format but it is possible to create
additional formats. There is an kernel registration interface for formats.
Each format is identified by a simple string which a tool can pass when a
monitoring session is created.
The interface also provides support for event set and multiplexing to work
around hardware limitations in the number of available counters or in how
events can be combined. Each set defines as many counters as the hardware
can support. The kernel then multiplexes the sets. The interface supports
time-based switching but also overflow-based switching, i.e., after n
overflows of designated counters.
Applications never manipulates the actual performance counter registers.
Instead they see a logical Performance Monitoring Unit (PMU) composed of a
set of config registers (PMC) and a set of data registers (PMD). Note that
PMD are not necessarily counters, they can be buffers. The logical PMU is
then mapped onto the actual PMU using a mapping table which is implemented
as a kernel module. The mapping is chosen once for each new processor. It is
visible in /sys/kernel/perfmon/pmu_desc. The kernel module is automatically
loaded on first use.
A monitoring session is uniquely identified by a file descriptor obtained
when the session is created. File sharing semantics apply to access the
session inside a process. A session is never inherited across fork. The file
descriptor can be used to receive counter overflow notifications or when the
sampling buffer is full. It is possible to use poll/select on the descriptor
to wait for notifications from multiple sessions. Similarly, the descriptor
supports asynchronous notifications via SIGIO.
Counters are always exported as being 64-bit wide regardless of what the
underlying hardware implements.
II/ Kernel compilation
To enable perfmon, you need to enable CONFIG_PERFMON and also some of the
model-specific PMU modules.
III/ OProfile interactions
The set of features offered by perfmon is rich enough to support migrating
Oprofile on top of it. That means that PMU programming and low-level
interrupt handling could be done by perfmon. The Oprofile sampling buffer
management code in the kernel as well as how samples are exported to users
could remain through the use of a sampling format. This is how Oprofile
works on Itanium.
The current interactions with Oprofile are:
- on X86: Both subsystems can be compiled into the same kernel. There
is enforced mutual exclusion between the two subsystems. When
there is an Oprofile session, no perfmon session can exist
and vice-versa.
- On IA-64: Oprofile works on top of perfmon. Oprofile being a
system-wide monitoring tool, the regular per-thread vs.
system-wide session restrictions apply.
- on PPC: no integration yet. Only one subsystem can be enabled.
- on MIPS: no integration yet. Only one subsystem can be enabled.
IV/ User tools
We have released a simple monitoring tool to demonstrate the features of
the interface. The tool is called pfmon and it comes with a simple helper
library called libpfm. The library comes with a set of examples to show
how to use the kernel interface. Visit http://perfmon2.sf.net for details.
There maybe other tools available for perfmon.
V/ How to program?
The best way to learn how to program perfmon, is to take a look at the
source code for the examples in libpfm. The source code is available from:
http://perfmon2.sf.net
VI/ System calls overview
The interface is implemented by the following system calls:
* int pfm_create_context(pfarg_ctx_t *ctx, char *fmt, void *arg, size_t arg_size)
This function create a perfmon2 context. The type of context is per-thread by
default unless PFM_FL_SYSTEM_WIDE is passed in ctx. The sampling format name
is passed in fmt. Arguments to the format are passed in arg which is of size
arg_size. Upon successful return, the file descriptor identifying the context
is returned.
* int pfm_write_pmds(int fd, pfarg_pmd_t *pmds, int n)
This function is used to program the PMD registers. It is possible to pass
vectors of PMDs.
* int pfm_write_pmcs(int fd, pfarg_pmc_t *pmds, int n)
This function is used to program the PMC registers. It is possible to pass
vectors of PMDs.
* int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)
This function is used to read the PMD registers. It is possible to pass
vectors of PMDs.
* int pfm_load_context(int fd, pfarg_load_t *load)
This function is used to attach the context to a thread or CPU.
Thread means kernel-visible thread (NPTL). The thread identification
as obtained by gettid must be passed to load->load_target.
To operate on another thread (not self), it is mandatory that the thread
be stopped via ptrace().
To attach to a CPU, the CPU number must be specified in load->load_target
AND the call must be issued on that CPU. To monitor a CPU, a thread MUST
be pinned on that CPU.
Until the context is attached, the actual counters are not accessed.
* int pfm_unload_context(int fd)
The context is detached for the thread or CPU is was attached to.
As a consequence monitoring is stopped.
When monitoring another thread, the thread MUST be stopped via ptrace()
for this function to succeed.
* int pfm_start(int fd, pfarg_start_t *st)
Start monitoring. The context must be attached for this function to succeed.
Optionally, it is possible to specify the event set on which to start using the
st argument, otherwise just pass NULL.
When monitoring another thread, the thread MUST be stopped via ptrace()
for this function to succeed.
* int pfm_stop(int fd)
Stop monitoring. The context must be attached for this function to succeed.
When monitoring another thread, the thread MUST be stopped via ptrace()
for this function to succeed.
* int pfm_create_evtsets(int fd, pfarg_setdesc_t *sets, int n)
This function is used to create or change event sets. By default set 0 exists.
It is possible to create/change multiple sets in one call.
The context must be detached for this call to succeed.
Sets are identified by a 16-bit integer. They are sorted based on this
set and switching occurs in a round-robin fashion.
* int pfm_delete_evtsets(int fd, pfarg_setdesc_t *sets, int n)
Delete event sets. The context must be detached for this call to succeed.
* int pfm_getinfo_evtsets(int fd, pfarg_setinfo_t *sets, int n)
Retrieve information about event sets. In particular it is possible
to get the number of activation of a set. It is possible to retrieve
information about multiple sets in one call.
* int pfm_restart(int fd)
Indicate to the kernel that the application is done processing an overflow
notification. A consequence of this call could be that monitoring resumes.
* int read(fd, pfm_msg_t *msg, sizeof(pfm_msg_t))
the regular read() system call can be used with the context file descriptor to
receive overflow notification messages. Non-blocking read() is supported.
Each message carry information about the overflow such as which counter overflowed
and where the program was (interrupted instruction pointer).
* int close(int fd)
To destroy a context, the regular close() system call is used.
VII/ /sys interface overview
Refer to Documentation/ABI/testing/sysfs-perfmon-* for a detailed description
of the sysfs interface of perfmon2.
VIII/ debugfs interface overview
Refer to Documentation/perfmon2-debugfs.txt for a detailed description of the
debug and statistics interface of perfmon2.
IX/ Documentation
Visit http://perfmon2.sf.net