| ############### | 
 | Timerlat tracer | 
 | ############### | 
 |  | 
 | The timerlat tracer aims to help the preemptive kernel developers to | 
 | find sources of wakeup latencies of real-time threads. Like cyclictest, | 
 | the tracer sets a periodic timer that wakes up a thread. The thread then | 
 | computes a *wakeup latency* value as the difference between the *current | 
 | time* and the *absolute time* that the timer was set to expire. The main | 
 | goal of timerlat is tracing in such a way to help kernel developers. | 
 |  | 
 | Usage | 
 | ----- | 
 |  | 
 | Write the ASCII text "timerlat" into the current_tracer file of the | 
 | tracing system (generally mounted at /sys/kernel/tracing). | 
 |  | 
 | For example:: | 
 |  | 
 |         [root@f32 ~]# cd /sys/kernel/tracing/ | 
 |         [root@f32 tracing]# echo timerlat > current_tracer | 
 |  | 
 | It is possible to follow the trace by reading the trace file:: | 
 |  | 
 |   [root@f32 tracing]# cat trace | 
 |   # tracer: timerlat | 
 |   # | 
 |   #                              _-----=> irqs-off | 
 |   #                             / _----=> need-resched | 
 |   #                            | / _---=> hardirq/softirq | 
 |   #                            || / _--=> preempt-depth | 
 |   #                            || / | 
 |   #                            ||||             ACTIVATION | 
 |   #         TASK-PID      CPU# ||||   TIMESTAMP    ID            CONTEXT                LATENCY | 
 |   #            | |         |   ||||      |         |                  |                       | | 
 |           <idle>-0       [000] d.h1    54.029328: #1     context    irq timer_latency       932 ns | 
 |            <...>-867     [000] ....    54.029339: #1     context thread timer_latency     11700 ns | 
 |           <idle>-0       [001] dNh1    54.029346: #1     context    irq timer_latency      2833 ns | 
 |            <...>-868     [001] ....    54.029353: #1     context thread timer_latency      9820 ns | 
 |           <idle>-0       [000] d.h1    54.030328: #2     context    irq timer_latency       769 ns | 
 |            <...>-867     [000] ....    54.030330: #2     context thread timer_latency      3070 ns | 
 |           <idle>-0       [001] d.h1    54.030344: #2     context    irq timer_latency       935 ns | 
 |            <...>-868     [001] ....    54.030347: #2     context thread timer_latency      4351 ns | 
 |  | 
 |  | 
 | The tracer creates a per-cpu kernel thread with real-time priority that | 
 | prints two lines at every activation. The first is the *timer latency* | 
 | observed at the *hardirq* context before the activation of the thread. | 
 | The second is the *timer latency* observed by the thread. The ACTIVATION | 
 | ID field serves to relate the *irq* execution to its respective *thread* | 
 | execution. | 
 |  | 
 | The *irq*/*thread* splitting is important to clarify in which context | 
 | the unexpected high value is coming from. The *irq* context can be | 
 | delayed by hardware-related actions, such as SMIs, NMIs, IRQs, | 
 | or by thread masking interrupts. Once the timer happens, the delay | 
 | can also be influenced by blocking caused by threads. For example, by | 
 | postponing the scheduler execution via preempt_disable(), scheduler | 
 | execution, or masking interrupts. Threads can also be delayed by the | 
 | interference from other threads and IRQs. | 
 |  | 
 | Tracer options | 
 | --------------------- | 
 |  | 
 | The timerlat tracer is built on top of osnoise tracer. | 
 | So its configuration is also done in the osnoise/ config | 
 | directory. The timerlat configs are: | 
 |  | 
 |  - cpus: CPUs at which a timerlat thread will execute. | 
 |  - timerlat_period_us: the period of the timerlat thread. | 
 |  - stop_tracing_us: stop the system tracing if a | 
 |    timer latency at the *irq* context higher than the configured | 
 |    value happens. Writing 0 disables this option. | 
 |  - stop_tracing_total_us: stop the system tracing if a | 
 |    timer latency at the *thread* context is higher than the configured | 
 |    value happens. Writing 0 disables this option. | 
 |  - print_stack: save the stack of the IRQ occurrence. The stack is printed | 
 |    after the *thread context* event, or at the IRQ handler if *stop_tracing_us* | 
 |    is hit. | 
 |  | 
 | timerlat and osnoise | 
 | ---------------------------- | 
 |  | 
 | The timerlat can also take advantage of the osnoise: traceevents. | 
 | For example:: | 
 |  | 
 |         [root@f32 ~]# cd /sys/kernel/tracing/ | 
 |         [root@f32 tracing]# echo timerlat > current_tracer | 
 |         [root@f32 tracing]# echo 1 > events/osnoise/enable | 
 |         [root@f32 tracing]# echo 25 > osnoise/stop_tracing_total_us | 
 |         [root@f32 tracing]# tail -10 trace | 
 |              cc1-87882   [005] d..h...   548.771078: #402268 context    irq timer_latency     13585 ns | 
 |              cc1-87882   [005] dNLh1..   548.771082: irq_noise: local_timer:236 start 548.771077442 duration 7597 ns | 
 |              cc1-87882   [005] dNLh2..   548.771099: irq_noise: qxl:21 start 548.771085017 duration 7139 ns | 
 |              cc1-87882   [005] d...3..   548.771102: thread_noise:      cc1:87882 start 548.771078243 duration 9909 ns | 
 |       timerlat/5-1035    [005] .......   548.771104: #402268 context thread timer_latency     39960 ns | 
 |  | 
 | In this case, the root cause of the timer latency does not point to a | 
 | single cause but to multiple ones. Firstly, the timer IRQ was delayed | 
 | for 13 us, which may point to a long IRQ disabled section (see IRQ | 
 | stacktrace section). Then the timer interrupt that wakes up the timerlat | 
 | thread took 7597 ns, and the qxl:21 device IRQ took 7139 ns. Finally, | 
 | the cc1 thread noise took 9909 ns of time before the context switch. | 
 | Such pieces of evidence are useful for the developer to use other | 
 | tracing methods to figure out how to debug and optimize the system. | 
 |  | 
 | It is worth mentioning that the *duration* values reported | 
 | by the osnoise: events are *net* values. For example, the | 
 | thread_noise does not include the duration of the overhead caused | 
 | by the IRQ execution (which indeed accounted for 12736 ns). But | 
 | the values reported by the timerlat tracer (timerlat_latency) | 
 | are *gross* values. | 
 |  | 
 | The art below illustrates a CPU timeline and how the timerlat tracer | 
 | observes it at the top and the osnoise: events at the bottom. Each "-" | 
 | in the timelines means circa 1 us, and the time moves ==>:: | 
 |  | 
 |       External     timer irq                   thread | 
 |        clock        latency                    latency | 
 |        event        13585 ns                   39960 ns | 
 |          |             ^                         ^ | 
 |          v             |                         | | 
 |          |-------------|                         | | 
 |          |-------------+-------------------------| | 
 |                        ^                         ^ | 
 |   ======================================================================== | 
 |                     [tmr irq]  [dev irq] | 
 |   [another thread...^       v..^       v.......][timerlat/ thread]  <-- CPU timeline | 
 |   ========================================================================= | 
 |                     |-------|  |-------| | 
 |                             |--^       v-------| | 
 |                             |          |       | | 
 |                             |          |       + thread_noise: 9909 ns | 
 |                             |          +-> irq_noise: 6139 ns | 
 |                             +-> irq_noise: 7597 ns | 
 |  | 
 | IRQ stacktrace | 
 | --------------------------- | 
 |  | 
 | The osnoise/print_stack option is helpful for the cases in which a thread | 
 | noise causes the major factor for the timer latency, because of preempt or | 
 | irq disabled. For example:: | 
 |  | 
 |         [root@f32 tracing]# echo 500 > osnoise/stop_tracing_total_us | 
 |         [root@f32 tracing]# echo 500 > osnoise/print_stack | 
 |         [root@f32 tracing]# echo timerlat > current_tracer | 
 |         [root@f32 tracing]# tail -21 per_cpu/cpu7/trace | 
 |           insmod-1026    [007] dN.h1..   200.201948: irq_noise: local_timer:236 start 200.201939376 duration 7872 ns | 
 |           insmod-1026    [007] d..h1..   200.202587: #29800 context    irq timer_latency      1616 ns | 
 |           insmod-1026    [007] dN.h2..   200.202598: irq_noise: local_timer:236 start 200.202586162 duration 11855 ns | 
 |           insmod-1026    [007] dN.h3..   200.202947: irq_noise: local_timer:236 start 200.202939174 duration 7318 ns | 
 |           insmod-1026    [007] d...3..   200.203444: thread_noise:   insmod:1026 start 200.202586933 duration 838681 ns | 
 |       timerlat/7-1001    [007] .......   200.203445: #29800 context thread timer_latency    859978 ns | 
 |       timerlat/7-1001    [007] ....1..   200.203446: <stack trace> | 
 |   => timerlat_irq | 
 |   => __hrtimer_run_queues | 
 |   => hrtimer_interrupt | 
 |   => __sysvec_apic_timer_interrupt | 
 |   => asm_call_irq_on_stack | 
 |   => sysvec_apic_timer_interrupt | 
 |   => asm_sysvec_apic_timer_interrupt | 
 |   => delay_tsc | 
 |   => dummy_load_1ms_pd_init | 
 |   => do_one_initcall | 
 |   => do_init_module | 
 |   => __do_sys_finit_module | 
 |   => do_syscall_64 | 
 |   => entry_SYSCALL_64_after_hwframe | 
 |  | 
 | In this case, it is possible to see that the thread added the highest | 
 | contribution to the *timer latency* and the stack trace, saved during | 
 | the timerlat IRQ handler, points to a function named | 
 | dummy_load_1ms_pd_init, which had the following code (on purpose):: | 
 |  | 
 | 	static int __init dummy_load_1ms_pd_init(void) | 
 | 	{ | 
 | 		preempt_disable(); | 
 | 		mdelay(1); | 
 | 		preempt_enable(); | 
 | 		return 0; | 
 |  | 
 | 	} | 
 |  | 
 | User-space interface | 
 | --------------------------- | 
 |  | 
 | Timerlat allows user-space threads to use timerlat infra-structure to | 
 | measure scheduling latency. This interface is accessible via a per-CPU | 
 | file descriptor inside $tracing_dir/osnoise/per_cpu/cpu$ID/timerlat_fd. | 
 |  | 
 | This interface is accessible under the following conditions: | 
 |  | 
 |  - timerlat tracer is enable | 
 |  - osnoise workload option is set to NO_OSNOISE_WORKLOAD | 
 |  - The user-space thread is affined to a single processor | 
 |  - The thread opens the file associated with its single processor | 
 |  - Only one thread can access the file at a time | 
 |  | 
 | The open() syscall will fail if any of these conditions are not met. | 
 | After opening the file descriptor, the user space can read from it. | 
 |  | 
 | The read() system call will run a timerlat code that will arm the | 
 | timer in the future and wait for it as the regular kernel thread does. | 
 |  | 
 | When the timer IRQ fires, the timerlat IRQ will execute, report the | 
 | IRQ latency and wake up the thread waiting in the read. The thread will be | 
 | scheduled and report the thread latency via tracer - as for the kernel | 
 | thread. | 
 |  | 
 | The difference from the in-kernel timerlat is that, instead of re-arming | 
 | the timer, timerlat will return to the read() system call. At this point, | 
 | the user can run any code. | 
 |  | 
 | If the application rereads the file timerlat file descriptor, the tracer | 
 | will report the return from user-space latency, which is the total | 
 | latency. If this is the end of the work, it can be interpreted as the | 
 | response time for the request. | 
 |  | 
 | After reporting the total latency, timerlat will restart the cycle, arm | 
 | a timer, and go to sleep for the following activation. | 
 |  | 
 | If at any time one of the conditions is broken, e.g., the thread migrates | 
 | while in user space, or the timerlat tracer is disabled, the SIG_KILL | 
 | signal will be sent to the user-space thread. | 
 |  | 
 | Here is an basic example of user-space code for timerlat:: | 
 |  | 
 |  int main(void) | 
 |  { | 
 | 	char buffer[1024]; | 
 | 	int timerlat_fd; | 
 | 	int retval; | 
 | 	long cpu = 0;   /* place in CPU 0 */ | 
 | 	cpu_set_t set; | 
 |  | 
 | 	CPU_ZERO(&set); | 
 | 	CPU_SET(cpu, &set); | 
 |  | 
 | 	if (sched_setaffinity(gettid(), sizeof(set), &set) == -1) | 
 | 		return 1; | 
 |  | 
 | 	snprintf(buffer, sizeof(buffer), | 
 | 		"/sys/kernel/tracing/osnoise/per_cpu/cpu%ld/timerlat_fd", | 
 | 		cpu); | 
 |  | 
 | 	timerlat_fd = open(buffer, O_RDONLY); | 
 | 	if (timerlat_fd < 0) { | 
 | 		printf("error opening %s: %s\n", buffer, strerror(errno)); | 
 | 		exit(1); | 
 | 	} | 
 |  | 
 | 	for (;;) { | 
 | 		retval = read(timerlat_fd, buffer, 1024); | 
 | 		if (retval < 0) | 
 | 			break; | 
 | 	} | 
 |  | 
 | 	close(timerlat_fd); | 
 | 	exit(0); | 
 |  } |