| ========================================= | 
 | user_events: User-based Event Tracing | 
 | ========================================= | 
 |  | 
 | :Author: Beau Belgrave | 
 |  | 
 | Overview | 
 | -------- | 
 | User based trace events allow user processes to create events and trace data | 
 | that can be viewed via existing tools, such as ftrace and perf. | 
 | To enable this feature, build your kernel with CONFIG_USER_EVENTS=y. | 
 |  | 
 | Programs can view status of the events via | 
 | /sys/kernel/tracing/user_events_status and can both register and write | 
 | data out via /sys/kernel/tracing/user_events_data. | 
 |  | 
 | Programs can also use /sys/kernel/tracing/dynamic_events to register and | 
 | delete user based events via the u: prefix. The format of the command to | 
 | dynamic_events is the same as the ioctl with the u: prefix applied. This | 
 | requires CAP_PERFMON due to the event persisting, otherwise -EPERM is returned. | 
 |  | 
 | Typically programs will register a set of events that they wish to expose to | 
 | tools that can read trace_events (such as ftrace and perf). The registration | 
 | process tells the kernel which address and bit to reflect if any tool has | 
 | enabled the event and data should be written. The registration will give back | 
 | a write index which describes the data when a write() or writev() is called | 
 | on the /sys/kernel/tracing/user_events_data file. | 
 |  | 
 | The structures referenced in this document are contained within the | 
 | /include/uapi/linux/user_events.h file in the source tree. | 
 |  | 
 | **NOTE:** *Both user_events_status and user_events_data are under the tracefs | 
 | filesystem and may be mounted at different paths than above.* | 
 |  | 
 | Registering | 
 | ----------- | 
 | Registering within a user process is done via ioctl() out to the | 
 | /sys/kernel/tracing/user_events_data file. The command to issue is | 
 | DIAG_IOCSREG. | 
 |  | 
 | This command takes a packed struct user_reg as an argument:: | 
 |  | 
 |   struct user_reg { | 
 |         /* Input: Size of the user_reg structure being used */ | 
 |         __u32 size; | 
 |  | 
 |         /* Input: Bit in enable address to use */ | 
 |         __u8 enable_bit; | 
 |  | 
 |         /* Input: Enable size in bytes at address */ | 
 |         __u8 enable_size; | 
 |  | 
 |         /* Input: Flags to use, if any */ | 
 |         __u16 flags; | 
 |  | 
 |         /* Input: Address to update when enabled */ | 
 |         __u64 enable_addr; | 
 |  | 
 |         /* Input: Pointer to string with event name, description and flags */ | 
 |         __u64 name_args; | 
 |  | 
 |         /* Output: Index of the event to use when writing data */ | 
 |         __u32 write_index; | 
 |   } __attribute__((__packed__)); | 
 |  | 
 | The struct user_reg requires all the above inputs to be set appropriately. | 
 |  | 
 | + size: This must be set to sizeof(struct user_reg). | 
 |  | 
 | + enable_bit: The bit to reflect the event status at the address specified by | 
 |   enable_addr. | 
 |  | 
 | + enable_size: The size of the value specified by enable_addr. | 
 |   This must be 4 (32-bit) or 8 (64-bit). 64-bit values are only allowed to be | 
 |   used on 64-bit kernels, however, 32-bit can be used on all kernels. | 
 |  | 
 | + flags: The flags to use, if any. | 
 |   Callers should first attempt to use flags and retry without flags to ensure | 
 |   support for lower versions of the kernel. If a flag is not supported -EINVAL | 
 |   is returned. | 
 |  | 
 | + enable_addr: The address of the value to use to reflect event status. This | 
 |   must be naturally aligned and write accessible within the user program. | 
 |  | 
 | + name_args: The name and arguments to describe the event, see command format | 
 |   for details. | 
 |  | 
 | The following flags are currently supported. | 
 |  | 
 | + USER_EVENT_REG_PERSIST: The event will not delete upon the last reference | 
 |   closing. Callers may use this if an event should exist even after the | 
 |   process closes or unregisters the event. Requires CAP_PERFMON otherwise | 
 |   -EPERM is returned. | 
 |  | 
 | Upon successful registration the following is set. | 
 |  | 
 | + write_index: The index to use for this file descriptor that represents this | 
 |   event when writing out data. The index is unique to this instance of the file | 
 |   descriptor that was used for the registration. See writing data for details. | 
 |  | 
 | User based events show up under tracefs like any other event under the | 
 | subsystem named "user_events". This means tools that wish to attach to the | 
 | events need to use /sys/kernel/tracing/events/user_events/[name]/enable | 
 | or perf record -e user_events:[name] when attaching/recording. | 
 |  | 
 | **NOTE:** The event subsystem name by default is "user_events". Callers should | 
 | not assume it will always be "user_events". Operators reserve the right in the | 
 | future to change the subsystem name per-process to accommodate event isolation. | 
 |  | 
 | Command Format | 
 | ^^^^^^^^^^^^^^ | 
 | The command string format is as follows:: | 
 |  | 
 |   name[:FLAG1[,FLAG2...]] [Field1[;Field2...]] | 
 |  | 
 | Supported Flags | 
 | ^^^^^^^^^^^^^^^ | 
 | None yet | 
 |  | 
 | Field Format | 
 | ^^^^^^^^^^^^ | 
 | :: | 
 |  | 
 |   type name [size] | 
 |  | 
 | Basic types are supported (__data_loc, u32, u64, int, char, char[20], etc). | 
 | User programs are encouraged to use clearly sized types like u32. | 
 |  | 
 | **NOTE:** *Long is not supported since size can vary between user and kernel.* | 
 |  | 
 | The size is only valid for types that start with a struct prefix. | 
 | This allows user programs to describe custom structs out to tools, if required. | 
 |  | 
 | For example, a struct in C that looks like this:: | 
 |  | 
 |   struct mytype { | 
 |     char data[20]; | 
 |   }; | 
 |  | 
 | Would be represented by the following field:: | 
 |  | 
 |   struct mytype myname 20 | 
 |  | 
 | Deleting | 
 | -------- | 
 | Deleting an event from within a user process is done via ioctl() out to the | 
 | /sys/kernel/tracing/user_events_data file. The command to issue is | 
 | DIAG_IOCSDEL. | 
 |  | 
 | This command only requires a single string specifying the event to delete by | 
 | its name. Delete will only succeed if there are no references left to the | 
 | event (in both user and kernel space). User programs should use a separate file | 
 | to request deletes than the one used for registration due to this. | 
 |  | 
 | **NOTE:** By default events will auto-delete when there are no references left | 
 | to the event. If programs do not want auto-delete, they must use the | 
 | USER_EVENT_REG_PERSIST flag when registering the event. Once that flag is used | 
 | the event exists until DIAG_IOCSDEL is invoked. Both register and delete of an | 
 | event that persists requires CAP_PERFMON, otherwise -EPERM is returned. | 
 |  | 
 | Unregistering | 
 | ------------- | 
 | If after registering an event it is no longer wanted to be updated then it can | 
 | be disabled via ioctl() out to the /sys/kernel/tracing/user_events_data file. | 
 | The command to issue is DIAG_IOCSUNREG. This is different than deleting, where | 
 | deleting actually removes the event from the system. Unregistering simply tells | 
 | the kernel your process is no longer interested in updates to the event. | 
 |  | 
 | This command takes a packed struct user_unreg as an argument:: | 
 |  | 
 |   struct user_unreg { | 
 |         /* Input: Size of the user_unreg structure being used */ | 
 |         __u32 size; | 
 |  | 
 |         /* Input: Bit to unregister */ | 
 |         __u8 disable_bit; | 
 |  | 
 |         /* Input: Reserved, set to 0 */ | 
 |         __u8 __reserved; | 
 |  | 
 |         /* Input: Reserved, set to 0 */ | 
 |         __u16 __reserved2; | 
 |  | 
 |         /* Input: Address to unregister */ | 
 |         __u64 disable_addr; | 
 |   } __attribute__((__packed__)); | 
 |  | 
 | The struct user_unreg requires all the above inputs to be set appropriately. | 
 |  | 
 | + size: This must be set to sizeof(struct user_unreg). | 
 |  | 
 | + disable_bit: This must be set to the bit to disable (same bit that was | 
 |   previously registered via enable_bit). | 
 |  | 
 | + disable_addr: This must be set to the address to disable (same address that was | 
 |   previously registered via enable_addr). | 
 |  | 
 | **NOTE:** Events are automatically unregistered when execve() is invoked. During | 
 | fork() the registered events will be retained and must be unregistered manually | 
 | in each process if wanted. | 
 |  | 
 | Status | 
 | ------ | 
 | When tools attach/record user based events the status of the event is updated | 
 | in realtime. This allows user programs to only incur the cost of the write() or | 
 | writev() calls when something is actively attached to the event. | 
 |  | 
 | The kernel will update the specified bit that was registered for the event as | 
 | tools attach/detach from the event. User programs simply check if the bit is set | 
 | to see if something is attached or not. | 
 |  | 
 | Administrators can easily check the status of all registered events by reading | 
 | the user_events_status file directly via a terminal. The output is as follows:: | 
 |  | 
 |   Name [# Comments] | 
 |   ... | 
 |  | 
 |   Active: ActiveCount | 
 |   Busy: BusyCount | 
 |  | 
 | For example, on a system that has a single event the output looks like this:: | 
 |  | 
 |   test | 
 |  | 
 |   Active: 1 | 
 |   Busy: 0 | 
 |  | 
 | If a user enables the user event via ftrace, the output would change to this:: | 
 |  | 
 |   test # Used by ftrace | 
 |  | 
 |   Active: 1 | 
 |   Busy: 1 | 
 |  | 
 | Writing Data | 
 | ------------ | 
 | After registering an event the same fd that was used to register can be used | 
 | to write an entry for that event. The write_index returned must be at the start | 
 | of the data, then the remaining data is treated as the payload of the event. | 
 |  | 
 | For example, if write_index returned was 1 and I wanted to write out an int | 
 | payload of the event. Then the data would have to be 8 bytes (2 ints) in size, | 
 | with the first 4 bytes being equal to 1 and the last 4 bytes being equal to the | 
 | value I want as the payload. | 
 |  | 
 | In memory this would look like this:: | 
 |  | 
 |   int index; | 
 |   int payload; | 
 |  | 
 | User programs might have well known structs that they wish to use to emit out | 
 | as payloads. In those cases writev() can be used, with the first vector being | 
 | the index and the following vector(s) being the actual event payload. | 
 |  | 
 | For example, if I have a struct like this:: | 
 |  | 
 |   struct payload { | 
 |         int src; | 
 |         int dst; | 
 |         int flags; | 
 |   } __attribute__((__packed__)); | 
 |  | 
 | It's advised for user programs to do the following:: | 
 |  | 
 |   struct iovec io[2]; | 
 |   struct payload e; | 
 |  | 
 |   io[0].iov_base = &write_index; | 
 |   io[0].iov_len = sizeof(write_index); | 
 |   io[1].iov_base = &e; | 
 |   io[1].iov_len = sizeof(e); | 
 |  | 
 |   writev(fd, (const struct iovec*)io, 2); | 
 |  | 
 | **NOTE:** *The write_index is not emitted out into the trace being recorded.* | 
 |  | 
 | Example Code | 
 | ------------ | 
 | See sample code in samples/user_events. |