blob: ff5da3e94ac64305b52e5de9251e5dfdb826d0e8 [file] [log] [blame]
=====================
Function based events
=====================
.. Copyright 2018 VMware Inc.
.. Author: Steven Rostedt <srostedt@goodmis.org>
.. License: The GNU Free Documentation License, Version 1.2
.. (dual licensed under the GPL v2)
Introduction
============
Static events are extremely useful for analyzing the happenings of
inside the Linux kernel. But there are times where events are not
available, either due to not being in control of the kernel, or simply
because a maintainer refuses to have them in their subsystem.
The function tracer is a way trace within a subsystem without trace events.
But it only provides information of when a function was hit and who
called it. Combining trace events with the function tracer allows
for dynamically creating trace events where they do not exist at
function entry. They provide more information than the function
tracer can provide, as they can read the parameters of a function
or simply read an address. This makes it possible to create a
trace point at any function that the function tracer can trace, and
read the parameters of the function.
[ Quick start guide, see tips and tricks at the bottom ]
Usage
=====
Simply writing an ASCII string into a file called "function_events"
in the tracefs file system will create the function based events.
Note, this file is only writable by root.
# mount -t tracefs nodev /sys/kernel/tracing
# cd /sys/kernel/tracing
# echo 'do_IRQ()' > function_events
The above will create a trace event on the do_IRQ function call.
As no parameters were specified, it will not trace anything other
than the function and the parent. This is the minimum function
based event.
# ls events/functions/do_IRQ
enable filter format hist id trigger
Even though the above function based event does not record much more
than the function tracer does, it does become a full fledge event.
This can be used by the histogram infrastructure, triggers, and perf
where one can attach eBPF programs to.
# cat events/functions/do_IRQ/format
name: do_IRQ
ID: 1304
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:unsigned long __parent_ip; offset:8; size:8; signed:0;
field:unsigned long __ip; offset:16; size:8; signed:0;
print fmt: "%pS->%pS()", REC->__ip, REC->__parent_ip
The above shows that the format is very close to the function trace
except that it displays the parent function followed by the called
function.
Number of arguments
===================
The number of arguments that can be specified is dependent on the
architecture. An architecture may not allow any arguments, or it
may limit to just three or six. If more arguments are used than
supported, it will fail with -EINVAL.
Parameters
==========
Adding parameters creates fields within the events. The format is
as follows:
# echo EVENT > function_events
EVENT := <function> '(' ARGS ')'
Where <function> is any function that the function tracer can trace.
ARGS := ARG | ARG ',' ARGS | ''
ARG := TYPE FIELD | TYPE <name> '=' ADDR | TYPE ADDR | ARG '|' ARG | 'NULL'
TYPE := ATOM | ATOM '[' <number> ']' | 'unsigned' TYPE
ATOM := 'u8' | 'u16' | 'u32' | 'u64' |
's8' | 's16' | 's32' | 's64' |
'x8' | 'x16' | 'x32' | 'x64' |
'char' | 'short' | 'int' | 'long' | 'size_t' |
'symbol' | 'string'
FIELD := <name> | <name> INDEX | <name> OFFSET | <name> OFFSET INDEX |
FIELD INDIRECT
INDEX := '[' <number> ']'
OFFSET := '+' <number>
INDIRECT := INDEX | OFFSET | INDIRECT INDIRECT | ''
ADDR := A hexidecimal address starting with '0x'
Where <name> is a unique string starting with an alphabetic character
and consists only of letters and numbers and underscores.
Where <number> is a number that can be read by kstrtol() (hex, decimal, etc).
Simple arguments
================
Looking at kernel code, we can see something like:
v4.15: net/ipv4/ip_input.c:
int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev)
If we are only interested in the first argument (skb):
# echo 'ip_rcv(x64 skb, x86 dev)' > function_events
# echo 1 > events/functions/ip_rcv/enable
# cat trace
<idle>-0 [003] ..s3 5543.133460: __netif_receive_skb_core->ip_rcv(skb=ffff88007f960700, net=ffff880114250000)
<idle>-0 [003] ..s3 5543.133475: __netif_receive_skb_core->ip_rcv(skb=ffff88007f960700, net=ffff880114250000)
<idle>-0 [003] ..s3 5543.312592: __netif_receive_skb_core->ip_rcv(skb=ffff88007f960700, net=ffff880114250000)
<idle>-0 [003] ..s3 5543.313150: __netif_receive_skb_core->ip_rcv(skb=ffff88007f960700, net=ffff880114250000)
We use "x64" in order to make sure that the data is displayed in hex.
This is on a x86_64 machine, and we know the pointer sizes are 8 bytes.
Indexing
========
The pointers of the skb and the dev isn't that interesting. But if we want the
length "len" field of skb, we could index it with an index operator '[' and ']'.
Using gdb, we can find the offset of 'len' from the sk_buff type:
$ gdb vmlinux
(gdb) printf "%d\n", &((struct sk_buff *)0)->len
128
As 128 / 4 (length of int) is 32, we can see the length of the skb with:
# echo 'ip_rcv(int skb[32], x64 dev)' > function_events
# echo 1 > events/functions/ip_rcv/enable
# cat trace
<idle>-0 [003] ..s3 280.167137: __netif_receive_skb_core->ip_rcv(skb=52, dev=ffff8801092f9400)
<idle>-0 [003] ..s3 280.167152: __netif_receive_skb_core->ip_rcv(skb=52, dev=ffff8801092f9400)
<idle>-0 [003] ..s3 280.806629: __netif_receive_skb_core->ip_rcv(skb=88, dev=ffff8801092f9400)
<idle>-0 [003] ..s3 280.807023: __netif_receive_skb_core->ip_rcv(skb=52, dev=ffff8801092f9400)
Now we see the length of the sk_buff per event.
Note: You can also use the tool "pahole" from the "dwarves" package:
$ pahole ./vmlinux -C sk_buf | grep len
unsigned int len; /* 128 4 */
Which gives you the "128" index from above. It also gives you the size
of the field which could be useful for determining the type.
Multiple fields per argument
============================
If we still want to see the skb pointer value along with the length of the
skb, then using the '|' option allows us to add more than one option to
an argument:
# echo 'ip_rcv(x64 skb | int skb[32], x64 dev)' > function_events
# echo 1 > events/functions/ip_rcv/enable
# cat trace
<idle>-0 [003] ..s3 904.075838: __netif_receive_skb_core->ip_rcv(skb=ffff88011396e800, skb=52, dev=ffff880115204000)
<idle>-0 [003] ..s3 904.075848: __netif_receive_skb_core->ip_rcv(skb=ffff88011396e800, skb=52, dev=ffff880115204000)
<idle>-0 [003] ..s3 904.725486: __netif_receive_skb_core->ip_rcv(skb=ffff88011396e800, skb=194, dev=ffff880115204000)
<idle>-0 [003] ..s3 905.152537: __netif_receive_skb_core->ip_rcv(skb=ffff88011396f200, skb=88, dev=ffff880115204000)
Unsigned usage
==============
One can also use "unsigned" to make some types unsigned. It works against
"long", "int", "short" and "char". It doesn't error against other types but
may not make any sense.
# echo 'ip_rcv(int skb[32])' > function_events
# cat events/functions/ip_rcv/format
name: ip_rcv
ID: 1397
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:unsigned long __parent_ip; offset:8; size:8; signed:0;
field:unsigned long __ip; offset:16; size:8; signed:0;
field:int skb; offset:24; size:4; signed:1;
print fmt: "%pS->%pS(skb=%d)", REC->__ip, REC->__parent_ip, REC->skb
Notice that REC->skb is printed with "%d". By adding "unsigned"
# echo 'ip_rcv(unsigned int skb[32])' > function_events
# cat events/functions/ip_rcv/format
name: ip_rcv
ID: 1398
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:unsigned long __parent_ip; offset:8; size:8; signed:0;
field:unsigned long __ip; offset:16; size:8; signed:0;
field:unsigned int skb; offset:24; size:4; signed:0;
print fmt: "%pS->%pS(skb=%u)", REC->__ip, REC->__parent_ip, REC->skb
It is now printed with a "%u".
Offsets
=======
After the name of the variable, brackets '[' number ']' will index the value of
the argument by the number given times the size of the field.
int field[5] will dereference the value of the argument 20 bytes away (4 * 5)
as sizeof(int) is 4.
If there's a case where the type is of 8 bytes in size but is not 8 bytes
alligned in the structure, an offset may be required.
For example: x64 param+4[2]
The above will take the parameter value, add it by 4, then index it by two
8 byte words. It's the same in C as: (u64 *)((void *)param + 4)[2]
Note: "int skb[32]" is the same as "int skb+4[31]".
Symbols (function names)
========================
To display kallsyms "%pS" type of output, use the special type "symbol".
Again, using gdb to find the offset of the "func" field of struct work_struct
(gdb) printf "%d\n", &((struct work_struct *)0)->func
24
Both "symbol func[3]" and "symbol func+24[0]" will work.
# echo '__queue_work(int cpu, x64 wq, symbol func[3])' > function_events
# echo 1 > events/functions/__queue_work/enable
# cat trace
bash-1641 [007] d..2 6241.171332: queue_work_on->__queue_work(cpu=128, wq=ffff88011a010e00, func=flush_to_ldisc+0x0/0xa0)
bash-1641 [007] d..2 6241.171460: queue_work_on->__queue_work(cpu=128, wq=ffff88011a010e00, func=flush_to_ldisc+0x0/0xa0)
<idle>-0 [000] dNs3 6241.172004: delayed_work_timer_fn->__queue_work(cpu=128, wq=ffff88011a010800, func=vmstat_shepherd+0x0/0xb0)
worker/0:2-1689 [000] d..2 6241.172026: __queue_delayed_work->__queue_work(cpu=7, wq=ffff88011a11da00, func=vmstat_update+0x0/0x70)
<idle>-0 [005] d.s3 6241.347996: queue_work_on->__queue_work(cpu=128, wq=ffff88011a011200, func=fb_flashcursor+0x0/0x110 [fb])
Direct memory access
====================
Function arguments are not the only thing that can be recorded from a function
based event. Memory addresses can also be examined. If there's a global variable
that you want to monitor via an interrupt, you can put in the address directly.
# grep total_forks /proc/kallsyms
ffffffff82354c18 B total_forks
# echo 'do_IRQ(int total_forks=0xffffffff82354c18)' > function_events
# echo 1 events/functions/do_IRQ/enable
# cat trace
<idle>-0 [003] d..3 337.076709: ret_from_intr->do_IRQ(total_forks=1419)
<idle>-0 [003] d..3 337.077046: ret_from_intr->do_IRQ(total_forks=1419)
<idle>-0 [003] d..3 337.077076: ret_from_intr->do_IRQ(total_forks=1420)
Note, address notations do not affect the argument count. For instance, with
__visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
# echo 'do_IRQ(int total_forks=0xffffffff82354c18, symbol regs[16])' > function_events
Is the same as
# echo 'do_IRQ(int total_forks=0xffffffff82354c18 | symbol regs[16])' > function_events
# cat trace
<idle>-0 [003] d..3 653.839546: ret_from_intr->do_IRQ(total_forks=1504, regs=cpuidle_enter_state+0xb1/0x330)
<idle>-0 [003] d..3 653.906011: ret_from_intr->do_IRQ(total_forks=1504, regs=cpuidle_enter_state+0xb1/0x330)
<idle>-0 [003] d..3 655.823498: ret_from_intr->do_IRQ(total_forks=1504, regs=tick_nohz_idle_enter+0x4c/0x50)
<idle>-0 [003] d..3 655.954096: ret_from_intr->do_IRQ(total_forks=1504, regs=cpuidle_enter_state+0xb1/0x330)
Array types
===========
If there's a case where you want to see an array of a type, then you can
declare a type as an array by adding '[' number ']' after the type.
To get the net_device perm_addr, from the dev parameter.
(gdb) printf "%d\n", &((struct net_device *)0)->perm_addr
558
# echo 'ip_rcv(x64 skb, x8[6] perm_addr+558)' > function_events
# echo 1 > events/functions/ip_rcv/enable
# cat trace
<idle>-0 [003] ..s3 219.813582: __netif_receive_skb_core->ip_rcv(skb=ffff880118195e00, perm_addr=b4,b5,2f,ce,18,65)
<idle>-0 [003] ..s3 219.813595: __netif_receive_skb_core->ip_rcv(skb=ffff880118195e00, perm_addr=b4,b5,2f,ce,18,65)
<idle>-0 [003] ..s3 220.115053: __netif_receive_skb_core->ip_rcv(skb=ffff880118195c00, perm_addr=b4,b5,2f,ce,18,65)
<idle>-0 [003] ..s3 220.115293: __netif_receive_skb_core->ip_rcv(skb=ffff880118195c00, perm_addr=b4,b5,2f,ce,18,65)
Static strings
==============
An array of type 'char' or 'unsigned char' will be processed as a string using
the format "%s". If a nul is found, the output will stop. Use another type
(x8, u8, s8) if this is not desired.
# echo 'link_path_walk(char[64] name)' > function_events
# echo 1 > events/functions/link_path_walk/enable
# cat trace
bash-1470 [003] ...2 980.678664: path_openat->link_path_walk(name=/usr/bin/cat)
bash-1470 [003] ...2 980.678715: path_openat->link_path_walk(name=/lib64/ld-linux-x86-64.so.2)
bash-1470 [003] ...2 980.678721: path_openat->link_path_walk(name=ld-2.24.so)
bash-1470 [003] ...2 980.678978: path_lookupat->link_path_walk(name=/etc/ld.so.preload)
Dynamic strings
===============
Static strings are fine, but they can waste a lot of memory in the ring buffer.
The above allocated 64 bytes for a character array, but most of the output was
less than 20 characters. Not wanting to truncate strings or waste space on
the ring buffer, the dynamic string can help.
Use the "string" type for strings that have a large range in size. The max
size that will be recorded is 512 bytes. If a string is larger than that, then
it will be truncated.
# echo 'link_path_walk(string name)' > function_events
Gives the same result as above, but does not waste buffer space.
NULL arguments
==============
If you are only interested in the second, or later parameter of a function,
you do not have to record the previous parameters. Just set them as NULL and
they will not be recorded.
If we only wanted the perm_addr of the net_device of ip_rcv() and not the
sk_buff, we put a NULL into the first parameter when created the function
based event.
# echo 'ip_rcv(NULL, x8[6] perm_addr+558)' > function_events
# echo 1 > events/functions/ip_rcv/enable
# cat trace
<idle>-0 [003] ..s3 165.617114: __netif_receive_skb_core->ip_rcv(perm_addr=b4,b5,2f,ce,18,65)
<idle>-0 [003] ..s3 165.617133: __netif_receive_skb_core->ip_rcv(perm_addr=b4,b5,2f,ce,18,65)
<idle>-0 [003] ..s3 166.412277: __netif_receive_skb_core->ip_rcv(perm_addr=b4,b5,2f,ce,18,65)
<idle>-0 [003] ..s3 166.412797: __netif_receive_skb_core->ip_rcv(perm_addr=b4,b5,2f,ce,18,65)
NULL can appear in any argument, to have them ignored. Note, skipping arguments
does not give you access to later arguments if they are not supported by the
architecture. The architecture only supplies the first set of arguments.
The chain of indirects
======================
When a parameter is a structure, and that structure points to another structure,
the data of that structure can still be found.
ssize_t __vfs_read(struct file *file, char __user *buf, size_t count,
loff_t *pos)
has the following code.
if (file->f_op->read)
return file->f_op->read(file, buf, count, pos);
To trace all the functions that are called by f_op->read(), that information
can be obtained from the file pointer.
Using gdb again:
(gdb) printf "%d\n", &((struct file *)0)->f_op
40
(gdb) printf "%d\n", &((struct file_operations *)0)->read
16
# echo '__vfs_read(symbol read+40[0]+16)' > function_events
# echo 1 > events/functions/__vfs_read/enable
# cat trace
sshd-1343 [005] ...2 199.734752: vfs_read->__vfs_read(read=tty_read+0x0/0xf0)
bash-1344 [003] ...2 199.734822: vfs_read->__vfs_read(read=tty_read+0x0/0xf0)
sshd-1343 [005] ...2 199.734835: vfs_read->__vfs_read(read=tty_read+0x0/0xf0)
avahi-daemon-910 [003] ...2 200.136740: vfs_read->__vfs_read(read= (null))
avahi-daemon-910 [003] ...2 200.136750: vfs_read->__vfs_read(read= (null))
Or to go a bit more extreme: To get the contents of a system type name
from a file: struct file->f_inode->s_sb->s_type->name
(gdb) printf "%dn", &((struct file *)0)->f_inode
32
(gdb) printf "%d\n", &((struct inode *)0)->i_sb
40
(gdb) printf "%d\n", &((struct super_block *)0)->s_type
40
Since a string does not have a common size, use of offsets must be used,
and only use a zero indirect ([0]).
# echo '__vfs_read(string name+32[0]+40[0]+40[0][0])' > function_events
To break the above down. The first argument passed to __vfs_read() is
a pointer to a "struct file".
'__vfs_read(x64 file)' Would return the address of the file.
'__vfs_read(x64 inode+32[0])' returns the address of the inode indexed in file
'__vfs_read(x64 sb+32[0]+40[0])' returns the address of the super block indexed
from the inode.
'__vfs_read(x64 stype+32[0]+40[0]+40[0])' returns the address of the
file system type, indexed from the super block.
'__vfs_read(x64 name+32[0]+40[0]+40[0][0]' returns the address of name, indexed
from the file system type.
The 'string' type requires the address of the string, where the above produces:
sshd-806 [000] ...2 939615.584601: vfs_read->__vfs_read(name=devtmpfs)
sshd-806 [000] ...2 939615.585328: vfs_read->__vfs_read(name=devtmpfs)
bash-807 [000] ...2 939615.585832: vfs_read->__vfs_read(name=devpts)
sshd-806 [000] ...2 939617.206237: vfs_read->__vfs_read(name=sockfs)
sshd-806 [000] ...2 939617.207103: vfs_read->__vfs_read(name=devtmpfs)
Tips and Tricks
===============
Simple args are:
"int val", "unsigned int val", "char x", "unsigned long addr",
You can also use "s#" and "u#" types:
"s32 val", "u32 val", "s8 x", "u64 addr"
The above are all printed in decimal "%d" or "%u", if you want hex...
"x32 val", "x8 x", "x64 addr"
If you want to have it use "%pS" to print (symbols)
"symbol addr" is like: "%pS", (void *)addr
Arrays can be expressed after the type:
"x8[6] mac" is like: "{%x:%x:%x:%x:%x:%x}", mac[0], mac[1], mac[2],
mac[3], mac[4], mac[5]
Where mac would be: unsigned char mac[6] type.
Note, arrays of type "char" and "unsigned char" turn into a static
string.
"char[10] str" is like: "%s", str
Where str is defined as char str[10];
If the argument is a pointer to a structure, you can index into the
structure:
"x64 ip[16]" is like: "%llx", ((u64 *)ip)[16]
Finally, if an argument is a pointer to a structure, and you want to
get to another structure that it points to, for example
struct sk_buff *skb;
and you want to get to:
skb->dev->perm_addr
when the parameter is a pointer to skb.
(gdb) printf "%d\n", &((struct sk_buff *)0)->dev
16
(gdb) printf "%d\n", &((struct net_device *)0)->perm_addr
558
The net_device *dev is 16 bytes into sk_buff, and the char array
perm_addr, is 558 bytes into the net_device structure.
Where perm_addr is an array of the mac address.
"x8[6] perm_addr+16[0]+558"
The "x8[6]" is to define the printed type to be an array of 6 u8 types
and will be printed as hex.
"perm_addr" defines the field name to assing the value for.
"+16" will take the parameter value and add 16 to it, similar
to "((void *)perm_addr)+16".
The "[0]" will read what is in the current location (perm_addr + 16).
The "+558" will take the value read from the location above, and
then add 558 to it.
Since the final value is an array, no indirection is needed.
"x8[1] perm_addr+16[0]+558" would yield the same as
"x8 perm_addr+16[0]+558[0]", except the former would add brackets.
Basically, "x8[6] perm_addr+16[0]+558" is equivalent to:
char *dev = (char **)(((void *)skb)+16)[0];
char *perm_addr = (char *)(dev+558);
"{%x:%x:%x:%x:%x:%x}", perm_addr[0], perm_addr[1], perm_addr[2],
perm_addr[3], perm_addr[4], perm_addr[5]
OK, the above is a bit complex ;-) But works nicely.