| ============= | 
 | Event Tracing | 
 | ============= | 
 |  | 
 | :Author: Theodore Ts'o | 
 | :Updated: Li Zefan and Tom Zanussi | 
 |  | 
 | 1. Introduction | 
 | =============== | 
 |  | 
 | Tracepoints (see Documentation/trace/tracepoints.rst) can be used | 
 | without creating custom kernel modules to register probe functions | 
 | using the event tracing infrastructure. | 
 |  | 
 | Not all tracepoints can be traced using the event tracing system; | 
 | the kernel developer must provide code snippets which define how the | 
 | tracing information is saved into the tracing buffer, and how the | 
 | tracing information should be printed. | 
 |  | 
 | 2. Using Event Tracing | 
 | ====================== | 
 |  | 
 | 2.1 Via the 'set_event' interface | 
 | --------------------------------- | 
 |  | 
 | The events which are available for tracing can be found in the file | 
 | /sys/kernel/debug/tracing/available_events. | 
 |  | 
 | To enable a particular event, such as 'sched_wakeup', simply echo it | 
 | to /sys/kernel/debug/tracing/set_event. For example:: | 
 |  | 
 | 	# echo sched_wakeup >> /sys/kernel/debug/tracing/set_event | 
 |  | 
 | .. Note:: '>>' is necessary, otherwise it will firstly disable all the events. | 
 |  | 
 | To disable an event, echo the event name to the set_event file prefixed | 
 | with an exclamation point:: | 
 |  | 
 | 	# echo '!sched_wakeup' >> /sys/kernel/debug/tracing/set_event | 
 |  | 
 | To disable all events, echo an empty line to the set_event file:: | 
 |  | 
 | 	# echo > /sys/kernel/debug/tracing/set_event | 
 |  | 
 | To enable all events, echo ``*:*`` or ``*:`` to the set_event file:: | 
 |  | 
 | 	# echo *:* > /sys/kernel/debug/tracing/set_event | 
 |  | 
 | The events are organized into subsystems, such as ext4, irq, sched, | 
 | etc., and a full event name looks like this: <subsystem>:<event>.  The | 
 | subsystem name is optional, but it is displayed in the available_events | 
 | file.  All of the events in a subsystem can be specified via the syntax | 
 | ``<subsystem>:*``; for example, to enable all irq events, you can use the | 
 | command:: | 
 |  | 
 | 	# echo 'irq:*' > /sys/kernel/debug/tracing/set_event | 
 |  | 
 | 2.2 Via the 'enable' toggle | 
 | --------------------------- | 
 |  | 
 | The events available are also listed in /sys/kernel/debug/tracing/events/ hierarchy | 
 | of directories. | 
 |  | 
 | To enable event 'sched_wakeup':: | 
 |  | 
 | 	# echo 1 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable | 
 |  | 
 | To disable it:: | 
 |  | 
 | 	# echo 0 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable | 
 |  | 
 | To enable all events in sched subsystem:: | 
 |  | 
 | 	# echo 1 > /sys/kernel/debug/tracing/events/sched/enable | 
 |  | 
 | To enable all events:: | 
 |  | 
 | 	# echo 1 > /sys/kernel/debug/tracing/events/enable | 
 |  | 
 | When reading one of these enable files, there are four results: | 
 |  | 
 |  - 0 - all events this file affects are disabled | 
 |  - 1 - all events this file affects are enabled | 
 |  - X - there is a mixture of events enabled and disabled | 
 |  - ? - this file does not affect any event | 
 |  | 
 | 2.3 Boot option | 
 | --------------- | 
 |  | 
 | In order to facilitate early boot debugging, use boot option:: | 
 |  | 
 | 	trace_event=[event-list] | 
 |  | 
 | event-list is a comma separated list of events. See section 2.1 for event | 
 | format. | 
 |  | 
 | 3. Defining an event-enabled tracepoint | 
 | ======================================= | 
 |  | 
 | See The example provided in samples/trace_events | 
 |  | 
 | 4. Event formats | 
 | ================ | 
 |  | 
 | Each trace event has a 'format' file associated with it that contains | 
 | a description of each field in a logged event.  This information can | 
 | be used to parse the binary trace stream, and is also the place to | 
 | find the field names that can be used in event filters (see section 5). | 
 |  | 
 | It also displays the format string that will be used to print the | 
 | event in text mode, along with the event name and ID used for | 
 | profiling. | 
 |  | 
 | Every event has a set of ``common`` fields associated with it; these are | 
 | the fields prefixed with ``common_``.  The other fields vary between | 
 | events and correspond to the fields defined in the TRACE_EVENT | 
 | definition for that event. | 
 |  | 
 | Each field in the format has the form:: | 
 |  | 
 |      field:field-type field-name; offset:N; size:N; | 
 |  | 
 | where offset is the offset of the field in the trace record and size | 
 | is the size of the data item, in bytes. | 
 |  | 
 | For example, here's the information displayed for the 'sched_wakeup' | 
 | event:: | 
 |  | 
 | 	# cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/format | 
 |  | 
 | 	name: sched_wakeup | 
 | 	ID: 60 | 
 | 	format: | 
 | 		field:unsigned short common_type;	offset:0;	size:2; | 
 | 		field:unsigned char common_flags;	offset:2;	size:1; | 
 | 		field:unsigned char common_preempt_count;	offset:3;	size:1; | 
 | 		field:int common_pid;	offset:4;	size:4; | 
 | 		field:int common_tgid;	offset:8;	size:4; | 
 |  | 
 | 		field:char comm[TASK_COMM_LEN];	offset:12;	size:16; | 
 | 		field:pid_t pid;	offset:28;	size:4; | 
 | 		field:int prio;	offset:32;	size:4; | 
 | 		field:int success;	offset:36;	size:4; | 
 | 		field:int cpu;	offset:40;	size:4; | 
 |  | 
 | 	print fmt: "task %s:%d [%d] success=%d [%03d]", REC->comm, REC->pid, | 
 | 		   REC->prio, REC->success, REC->cpu | 
 |  | 
 | This event contains 10 fields, the first 5 common and the remaining 5 | 
 | event-specific.  All the fields for this event are numeric, except for | 
 | 'comm' which is a string, a distinction important for event filtering. | 
 |  | 
 | 5. Event filtering | 
 | ================== | 
 |  | 
 | Trace events can be filtered in the kernel by associating boolean | 
 | 'filter expressions' with them.  As soon as an event is logged into | 
 | the trace buffer, its fields are checked against the filter expression | 
 | associated with that event type.  An event with field values that | 
 | 'match' the filter will appear in the trace output, and an event whose | 
 | values don't match will be discarded.  An event with no filter | 
 | associated with it matches everything, and is the default when no | 
 | filter has been set for an event. | 
 |  | 
 | 5.1 Expression syntax | 
 | --------------------- | 
 |  | 
 | A filter expression consists of one or more 'predicates' that can be | 
 | combined using the logical operators '&&' and '||'.  A predicate is | 
 | simply a clause that compares the value of a field contained within a | 
 | logged event with a constant value and returns either 0 or 1 depending | 
 | on whether the field value matched (1) or didn't match (0):: | 
 |  | 
 | 	  field-name relational-operator value | 
 |  | 
 | Parentheses can be used to provide arbitrary logical groupings and | 
 | double-quotes can be used to prevent the shell from interpreting | 
 | operators as shell metacharacters. | 
 |  | 
 | The field-names available for use in filters can be found in the | 
 | 'format' files for trace events (see section 4). | 
 |  | 
 | The relational-operators depend on the type of the field being tested: | 
 |  | 
 | The operators available for numeric fields are: | 
 |  | 
 | ==, !=, <, <=, >, >=, & | 
 |  | 
 | And for string fields they are: | 
 |  | 
 | ==, !=, ~ | 
 |  | 
 | The glob (~) accepts a wild card character (\*,?) and character classes | 
 | ([). For example:: | 
 |  | 
 |   prev_comm ~ "*sh" | 
 |   prev_comm ~ "sh*" | 
 |   prev_comm ~ "*sh*" | 
 |   prev_comm ~ "ba*sh" | 
 |  | 
 | 5.2 Setting filters | 
 | ------------------- | 
 |  | 
 | A filter for an individual event is set by writing a filter expression | 
 | to the 'filter' file for the given event. | 
 |  | 
 | For example:: | 
 |  | 
 | 	# cd /sys/kernel/debug/tracing/events/sched/sched_wakeup | 
 | 	# echo "common_preempt_count > 4" > filter | 
 |  | 
 | A slightly more involved example:: | 
 |  | 
 | 	# cd /sys/kernel/debug/tracing/events/signal/signal_generate | 
 | 	# echo "((sig >= 10 && sig < 15) || sig == 17) && comm != bash" > filter | 
 |  | 
 | If there is an error in the expression, you'll get an 'Invalid | 
 | argument' error when setting it, and the erroneous string along with | 
 | an error message can be seen by looking at the filter e.g.:: | 
 |  | 
 | 	# cd /sys/kernel/debug/tracing/events/signal/signal_generate | 
 | 	# echo "((sig >= 10 && sig < 15) || dsig == 17) && comm != bash" > filter | 
 | 	-bash: echo: write error: Invalid argument | 
 | 	# cat filter | 
 | 	((sig >= 10 && sig < 15) || dsig == 17) && comm != bash | 
 | 	^ | 
 | 	parse_error: Field not found | 
 |  | 
 | Currently the caret ('^') for an error always appears at the beginning of | 
 | the filter string; the error message should still be useful though | 
 | even without more accurate position info. | 
 |  | 
 | 5.3 Clearing filters | 
 | -------------------- | 
 |  | 
 | To clear the filter for an event, write a '0' to the event's filter | 
 | file. | 
 |  | 
 | To clear the filters for all events in a subsystem, write a '0' to the | 
 | subsystem's filter file. | 
 |  | 
 | 5.3 Subsystem filters | 
 | --------------------- | 
 |  | 
 | For convenience, filters for every event in a subsystem can be set or | 
 | cleared as a group by writing a filter expression into the filter file | 
 | at the root of the subsystem.  Note however, that if a filter for any | 
 | event within the subsystem lacks a field specified in the subsystem | 
 | filter, or if the filter can't be applied for any other reason, the | 
 | filter for that event will retain its previous setting.  This can | 
 | result in an unintended mixture of filters which could lead to | 
 | confusing (to the user who might think different filters are in | 
 | effect) trace output.  Only filters that reference just the common | 
 | fields can be guaranteed to propagate successfully to all events. | 
 |  | 
 | Here are a few subsystem filter examples that also illustrate the | 
 | above points: | 
 |  | 
 | Clear the filters on all events in the sched subsystem:: | 
 |  | 
 | 	# cd /sys/kernel/debug/tracing/events/sched | 
 | 	# echo 0 > filter | 
 | 	# cat sched_switch/filter | 
 | 	none | 
 | 	# cat sched_wakeup/filter | 
 | 	none | 
 |  | 
 | Set a filter using only common fields for all events in the sched | 
 | subsystem (all events end up with the same filter):: | 
 |  | 
 | 	# cd /sys/kernel/debug/tracing/events/sched | 
 | 	# echo common_pid == 0 > filter | 
 | 	# cat sched_switch/filter | 
 | 	common_pid == 0 | 
 | 	# cat sched_wakeup/filter | 
 | 	common_pid == 0 | 
 |  | 
 | Attempt to set a filter using a non-common field for all events in the | 
 | sched subsystem (all events but those that have a prev_pid field retain | 
 | their old filters):: | 
 |  | 
 | 	# cd /sys/kernel/debug/tracing/events/sched | 
 | 	# echo prev_pid == 0 > filter | 
 | 	# cat sched_switch/filter | 
 | 	prev_pid == 0 | 
 | 	# cat sched_wakeup/filter | 
 | 	common_pid == 0 | 
 |  | 
 | 5.4 PID filtering | 
 | ----------------- | 
 |  | 
 | The set_event_pid file in the same directory as the top events directory | 
 | exists, will filter all events from tracing any task that does not have the | 
 | PID listed in the set_event_pid file. | 
 | :: | 
 |  | 
 | 	# cd /sys/kernel/debug/tracing | 
 | 	# echo $$ > set_event_pid | 
 | 	# echo 1 > events/enable | 
 |  | 
 | Will only trace events for the current task. | 
 |  | 
 | To add more PIDs without losing the PIDs already included, use '>>'. | 
 | :: | 
 |  | 
 | 	# echo 123 244 1 >> set_event_pid | 
 |  | 
 |  | 
 | 6. Event triggers | 
 | ================= | 
 |  | 
 | Trace events can be made to conditionally invoke trigger 'commands' | 
 | which can take various forms and are described in detail below; | 
 | examples would be enabling or disabling other trace events or invoking | 
 | a stack trace whenever the trace event is hit.  Whenever a trace event | 
 | with attached triggers is invoked, the set of trigger commands | 
 | associated with that event is invoked.  Any given trigger can | 
 | additionally have an event filter of the same form as described in | 
 | section 5 (Event filtering) associated with it - the command will only | 
 | be invoked if the event being invoked passes the associated filter. | 
 | If no filter is associated with the trigger, it always passes. | 
 |  | 
 | Triggers are added to and removed from a particular event by writing | 
 | trigger expressions to the 'trigger' file for the given event. | 
 |  | 
 | A given event can have any number of triggers associated with it, | 
 | subject to any restrictions that individual commands may have in that | 
 | regard. | 
 |  | 
 | Event triggers are implemented on top of "soft" mode, which means that | 
 | whenever a trace event has one or more triggers associated with it, | 
 | the event is activated even if it isn't actually enabled, but is | 
 | disabled in a "soft" mode.  That is, the tracepoint will be called, | 
 | but just will not be traced, unless of course it's actually enabled. | 
 | This scheme allows triggers to be invoked even for events that aren't | 
 | enabled, and also allows the current event filter implementation to be | 
 | used for conditionally invoking triggers. | 
 |  | 
 | The syntax for event triggers is roughly based on the syntax for | 
 | set_ftrace_filter 'ftrace filter commands' (see the 'Filter commands' | 
 | section of Documentation/trace/ftrace.rst), but there are major | 
 | differences and the implementation isn't currently tied to it in any | 
 | way, so beware about making generalizations between the two. | 
 |  | 
 | Note: Writing into trace_marker (See Documentation/trace/ftrace.rst) | 
 |      can also enable triggers that are written into | 
 |      /sys/kernel/tracing/events/ftrace/print/trigger | 
 |  | 
 | 6.1 Expression syntax | 
 | --------------------- | 
 |  | 
 | Triggers are added by echoing the command to the 'trigger' file:: | 
 |  | 
 |   # echo 'command[:count] [if filter]' > trigger | 
 |  | 
 | Triggers are removed by echoing the same command but starting with '!' | 
 | to the 'trigger' file:: | 
 |  | 
 |   # echo '!command[:count] [if filter]' > trigger | 
 |  | 
 | The [if filter] part isn't used in matching commands when removing, so | 
 | leaving that off in a '!' command will accomplish the same thing as | 
 | having it in. | 
 |  | 
 | The filter syntax is the same as that described in the 'Event | 
 | filtering' section above. | 
 |  | 
 | For ease of use, writing to the trigger file using '>' currently just | 
 | adds or removes a single trigger and there's no explicit '>>' support | 
 | ('>' actually behaves like '>>') or truncation support to remove all | 
 | triggers (you have to use '!' for each one added.) | 
 |  | 
 | 6.2 Supported trigger commands | 
 | ------------------------------ | 
 |  | 
 | The following commands are supported: | 
 |  | 
 | - enable_event/disable_event | 
 |  | 
 |   These commands can enable or disable another trace event whenever | 
 |   the triggering event is hit.  When these commands are registered, | 
 |   the other trace event is activated, but disabled in a "soft" mode. | 
 |   That is, the tracepoint will be called, but just will not be traced. | 
 |   The event tracepoint stays in this mode as long as there's a trigger | 
 |   in effect that can trigger it. | 
 |  | 
 |   For example, the following trigger causes kmalloc events to be | 
 |   traced when a read system call is entered, and the :1 at the end | 
 |   specifies that this enablement happens only once:: | 
 |  | 
 | 	  # echo 'enable_event:kmem:kmalloc:1' > \ | 
 | 	      /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger | 
 |  | 
 |   The following trigger causes kmalloc events to stop being traced | 
 |   when a read system call exits.  This disablement happens on every | 
 |   read system call exit:: | 
 |  | 
 | 	  # echo 'disable_event:kmem:kmalloc' > \ | 
 | 	      /sys/kernel/debug/tracing/events/syscalls/sys_exit_read/trigger | 
 |  | 
 |   The format is:: | 
 |  | 
 |       enable_event:<system>:<event>[:count] | 
 |       disable_event:<system>:<event>[:count] | 
 |  | 
 |   To remove the above commands:: | 
 |  | 
 | 	  # echo '!enable_event:kmem:kmalloc:1' > \ | 
 | 	      /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger | 
 |  | 
 | 	  # echo '!disable_event:kmem:kmalloc' > \ | 
 | 	      /sys/kernel/debug/tracing/events/syscalls/sys_exit_read/trigger | 
 |  | 
 |   Note that there can be any number of enable/disable_event triggers | 
 |   per triggering event, but there can only be one trigger per | 
 |   triggered event. e.g. sys_enter_read can have triggers enabling both | 
 |   kmem:kmalloc and sched:sched_switch, but can't have two kmem:kmalloc | 
 |   versions such as kmem:kmalloc and kmem:kmalloc:1 or 'kmem:kmalloc if | 
 |   bytes_req == 256' and 'kmem:kmalloc if bytes_alloc == 256' (they | 
 |   could be combined into a single filter on kmem:kmalloc though). | 
 |  | 
 | - stacktrace | 
 |  | 
 |   This command dumps a stacktrace in the trace buffer whenever the | 
 |   triggering event occurs. | 
 |  | 
 |   For example, the following trigger dumps a stacktrace every time the | 
 |   kmalloc tracepoint is hit:: | 
 |  | 
 | 	  # echo 'stacktrace' > \ | 
 | 		/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger | 
 |  | 
 |   The following trigger dumps a stacktrace the first 5 times a kmalloc | 
 |   request happens with a size >= 64K:: | 
 |  | 
 | 	  # echo 'stacktrace:5 if bytes_req >= 65536' > \ | 
 | 		/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger | 
 |  | 
 |   The format is:: | 
 |  | 
 |       stacktrace[:count] | 
 |  | 
 |   To remove the above commands:: | 
 |  | 
 | 	  # echo '!stacktrace' > \ | 
 | 		/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger | 
 |  | 
 | 	  # echo '!stacktrace:5 if bytes_req >= 65536' > \ | 
 | 		/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger | 
 |  | 
 |   The latter can also be removed more simply by the following (without | 
 |   the filter):: | 
 |  | 
 | 	  # echo '!stacktrace:5' > \ | 
 | 		/sys/kernel/debug/tracing/events/kmem/kmalloc/trigger | 
 |  | 
 |   Note that there can be only one stacktrace trigger per triggering | 
 |   event. | 
 |  | 
 | - snapshot | 
 |  | 
 |   This command causes a snapshot to be triggered whenever the | 
 |   triggering event occurs. | 
 |  | 
 |   The following command creates a snapshot every time a block request | 
 |   queue is unplugged with a depth > 1.  If you were tracing a set of | 
 |   events or functions at the time, the snapshot trace buffer would | 
 |   capture those events when the trigger event occurred:: | 
 |  | 
 | 	  # echo 'snapshot if nr_rq > 1' > \ | 
 | 		/sys/kernel/debug/tracing/events/block/block_unplug/trigger | 
 |  | 
 |   To only snapshot once:: | 
 |  | 
 | 	  # echo 'snapshot:1 if nr_rq > 1' > \ | 
 | 		/sys/kernel/debug/tracing/events/block/block_unplug/trigger | 
 |  | 
 |   To remove the above commands:: | 
 |  | 
 | 	  # echo '!snapshot if nr_rq > 1' > \ | 
 | 		/sys/kernel/debug/tracing/events/block/block_unplug/trigger | 
 |  | 
 | 	  # echo '!snapshot:1 if nr_rq > 1' > \ | 
 | 		/sys/kernel/debug/tracing/events/block/block_unplug/trigger | 
 |  | 
 |   Note that there can be only one snapshot trigger per triggering | 
 |   event. | 
 |  | 
 | - traceon/traceoff | 
 |  | 
 |   These commands turn tracing on and off when the specified events are | 
 |   hit. The parameter determines how many times the tracing system is | 
 |   turned on and off. If unspecified, there is no limit. | 
 |  | 
 |   The following command turns tracing off the first time a block | 
 |   request queue is unplugged with a depth > 1.  If you were tracing a | 
 |   set of events or functions at the time, you could then examine the | 
 |   trace buffer to see the sequence of events that led up to the | 
 |   trigger event:: | 
 |  | 
 | 	  # echo 'traceoff:1 if nr_rq > 1' > \ | 
 | 		/sys/kernel/debug/tracing/events/block/block_unplug/trigger | 
 |  | 
 |   To always disable tracing when nr_rq  > 1:: | 
 |  | 
 | 	  # echo 'traceoff if nr_rq > 1' > \ | 
 | 		/sys/kernel/debug/tracing/events/block/block_unplug/trigger | 
 |  | 
 |   To remove the above commands:: | 
 |  | 
 | 	  # echo '!traceoff:1 if nr_rq > 1' > \ | 
 | 		/sys/kernel/debug/tracing/events/block/block_unplug/trigger | 
 |  | 
 | 	  # echo '!traceoff if nr_rq > 1' > \ | 
 | 		/sys/kernel/debug/tracing/events/block/block_unplug/trigger | 
 |  | 
 |   Note that there can be only one traceon or traceoff trigger per | 
 |   triggering event. | 
 |  | 
 | - hist | 
 |  | 
 |   This command aggregates event hits into a hash table keyed on one or | 
 |   more trace event format fields (or stacktrace) and a set of running | 
 |   totals derived from one or more trace event format fields and/or | 
 |   event counts (hitcount). | 
 |  | 
 |   See Documentation/trace/histogram.rst for details and examples. |