| Evaluating Realtime Linux system performance with rteval |
| Clark Williams <williams@redhat.com> |
| -------------------------------------------------------- |
| |
| Abstract |
| -------- |
| |
| One of the problems of developing and fielding any software product |
| that runs on a wide range of hardware platforms, is determining how |
| well the product performs on those platforms. This is especially true |
| about developing something as closely tied to hardware as a Realtime |
| Linux system. So, how to measure the performance of a realtime Linux |
| kernel on a particular hardware platform? What defines "good |
| performance" for a realtime system? |
| |
| A real time system is one which must meet deadlines of some |
| sort. These deadlines may be periodic (e.g. occuring every 200 |
| milliseconds) or they may be a some time limit following the occurance |
| of an event (e.g. no more than 100 milliseconds following the arrival |
| of a network packet). To give a realtime application the best chance |
| of meeting its deadline(s), a realtime OS must minimize the time |
| between event occurance and the servicing of that event (latency). |
| |
| This paper describes the 'rteval' program, a Python 3.x program |
| developed at Red Hat to help quantify realtime performance on the |
| Realtime kernel. Rteval is an attempt to put together a synthetic |
| benchmark which mimics a well behaved realtime application, running on |
| a heavily loaded realtime Linux system. It uses the 'cyclictest' |
| program in the role of the realtime app and uses two loads, a parallel |
| build of a Linux kernel and the scheduler benchmark 'hackbench' to |
| boost the system load. |
| |
| Rteval runs for a specified length of time (typically 12 hours). When |
| an rteval run is completed, a statisical analysis of the results is |
| done, an XML file is generated, containing system state, raw result |
| data and statistical analysis results. |
| |
| The Load Applications |
| --------------------- |
| |
| Rteval uses two system loads. The first is looping a scheduler |
| benchmark called 'hackbench'. Hackbench creates pairs of threads which |
| send data from a sender to a receiver via a pipe or socket. It creates |
| many small processes which do lots of I/O, exercising the kernel |
| scheduler by causing many scheduling decisions. The rteval wrapper |
| class for hackbench continually runs the hackbench program until |
| signaled to stop by the main logic. The number of hackbench threads is |
| determined by the number of cpu cores available on the test system. |
| |
| The other load used by rteval is a parallel compile of the Linux |
| kernel. Rteval has a module named 'kcompile' which controls the kernel |
| build process by invoking the make process with 2 times the number of |
| online cpus simultaneous build jobs. The clean, bzImage and modules |
| targets are built with an 'allmodconfig' configuration file to |
| maximize the amount of compilation done. This results in a large |
| amount of process creation (preprocessors, compiler, assemblers and |
| linkers) as well as a moderately heavy file I/O load. The kernel build |
| load is repeated until the rteval runtime is reached. |
| |
| The intent behind having the load programs is to generate enough |
| threads doing a balanced load of operations (disk I/O, computation, |
| IPC, etc.) so that there is no time in which a processor core in the |
| system under test does not have a process ready to run. The success of |
| the loads can be measured by watching the system 'load average' |
| (either by examining /proc/loadavg or running the 'uptime' or 'top' |
| programs). |
| |
| The Measurement Application |
| --------------------------- |
| |
| The cyclictest program is used as the realtime application. Cyclictest |
| measures the delay between timer expiration and the time when the program |
| waiting for the timer actually runs. The way it does this is by taking |
| a timestamp (t1) just before calling the timer wait function, then |
| sleeping for a specified interval. Upon waking up, a second timestamp |
| (t2) is taken. Then the difference is calculated between the timer |
| expiration time (t1 + interval) and the actual wakeup time (t2). This |
| is the event latency. For example, if the initial time stamp t1 is |
| 1000 and the interval is 100, then the calculated wakup time is |
| 1100. If the wakeup time stamp (t2) is 1110, then cyclictest would |
| report a latency of 10. |
| |
| The cyclictest program is run in one of two modes, with either the |
| --smp option or the --numa option, based on the number of memory nodes |
| detected on the system. Both of these cases create a measurement |
| thread for each online cpu in the system and these threads are run |
| with a SCHED_FIFO scheduling policy at priority 95. All memory |
| allocations done by cyclictest are locked into memory using the |
| mlockall(2) system call (to eliminate major page faults). The |
| measurement threads are run with the same interval (100 microseconds) |
| using the clock_gettime(2) call to get time stamps and the |
| clock_nanosleep(2) call to actually invoke a timer. Cyclictest keeps a |
| histogram of observed latency values for each thread, which is dumped |
| to standard output and read by rteval when the run is complete. |
| |
| The Results |
| ----------- |
| |
| The main idea behind the 'rteval' program was to get two pieces of |
| information about the performance of the RT Linux kernel on a |
| particular hardware platform: |
| |
| 1. What is the maximum latency seen? |
| 2. How variable are the service times? |
| |
| The first is easy, just iterate through the histograms returned for |
| each cpu and find the highest index with a non-zero value. The second |
| is a little more complicated. |
| |
| Early in rteval development, rather than use a histogram the |
| cyclictest run would just dump all the samples to a file and rteval |
| would parse the file after the run. Unfortunately, when you're |
| sampling at a rate of once every 100 microseconds for each cpu in the |
| system, you're going to generate a *lot* of data. Especially since we |
| want to run rteval for many hours, possibly days. The output from |
| cyclictest in that mode is a 26-character string for each sample |
| recorded, so when sampling at 100us you generate 10,000 samples per |
| second, so for a 1 hour run on a four core box, you'd get: |
| |
| 10,000 * 60 * 60 * 4 == 144,000,000 samples/hr |
| |
| Multiply that times the 26 character string written by cyclictest and |
| you write 374,400,000 bytes per hour to disk. A 12 hour run on a four |
| core system would generate about 44 gigabytes of data. This was deemed |
| excessive... |
| |
| So the decision was made to record the latency values in histogram |
| format, one histogram for each measurement thread. This has the |
| advantage of using only a fixed amount of memory to record samples, |
| but has the disadvantage of losing temporal ordering information, |
| which would allow you to detect periodic latency's by looking at the |
| time stamp for a spike. It also complicates statistics calculations |
| which presume you have the entire data set for analysis. This was |
| worked around by treating each non-zero histogram bucket as a series |
| of samples for that index value. |
| |
| The variability calculation is basically a stock standard deviation |
| calculation, where a mean is calculated for the data set and then |
| variance and standard deviation are calculated. Other measures of |
| variability of such as Mean Absolute Deviation are calulated, but to |
| date Standard Deviation has been a reliable indicator of the |
| variability of service times. This variability is sometimes called |
| 'jitter' in realtime paralance, due to the plot the data would make on |
| an oscilloscope. |
| |
| In addition to calculating mean, variance and standard deviation, |
| rteval's statistics code calculates and stores the usual suspects for |
| statistics, such as min, max, mode, median, and range, both for each |
| cpu core and aggregated for all cores. |
| |
| Another challenge was in identifying the underlying hardware platform |
| so that runs on the same system could be grouped properly. The Desktop |
| Management Interface (DMI) tables maintaned by the BIOS were a good |
| starting point, since they record information about the cpu, memory, |
| and peripheral devices in the system. Added to that information is |
| some state about the system while the test was running: kernel |
| version, active clocksource, number of NUMA nodes available, kernel |
| threads and their priorities, kernel modules loaded, and the state of |
| network interfaces. |
| |
| Problems |
| -------- |
| |
| Using rteval has helped Red Hat locate areas that cause performance |
| problems with realtime Linux kernels. Some of these problem areas are: |
| |
| 1. BIOS/SMI issues |
| Many systems use System Management Interrupts (SMIs) to perform system |
| critical operations without help from the running operating system by |
| trapping back to BIOS code. Unfortunately, this causes 'gaps in time' |
| for the kernel, since nothing can run while an SMI is being handled in |
| the BIOS. Most times SMI impact is negligable since it's mostly |
| thermal management, reading a thermocouple and turning fans on/off. |
| Sometimes though the operation takes a long time (i.e when an EDAC |
| needs to correct a memory error) and that can cause deadlines to |
| be missed by many hundreds of microseconds. Red Hat has been working |
| with hardware vendors to identify these hotspots and reduce their |
| impact. |
| |
| 2. Kernel scalability issues |
| |
| In the past few years, the number of cores per socket on a motherboard |
| has gone up from 2 to 8, resulting in some scalability problems in the |
| kernel. One area that has received a lot of attention is the load |
| balancer for SCHED_OTHER tasks. This is logic in the kernel that |
| attempts to make sure that each core in the system has tasks to run |
| and that no one core is overloaded with tasks. During a load balancer |
| pass, a core with a long run queue (indicating there are many tasks |
| ready on that core) will have some of those tasks migrated to other |
| cores, which requires that both the current and destination cores run |
| queue locks being held (meaning nothing can run on those cores). |
| |
| In a stock Linux kernel long SCHED_OTHER load balancer passes result |
| in more utilization of cpus and an overall througput gain. |
| Unfortunately long load balancer passes can result in missed |
| deadlines because a task on the run queue for a core cannot run while |
| the loadbalancer is running. To compensate for this on realtime Linux |
| the load balancer has a lower number of target migrations and looks |
| for contention on the run queue locks (meaning that a task is trying |
| to be scheduled on one of the cores on which the balancer is |
| operating). Research in this area is ongoing. |
| |
| There is also a realtime thread (SCHED_FIFO and SCHED_RR) thread load |
| balancer and similar research is being done towards reducing the |
| overhead of this load balancer as well. |
| |
| <what other areas?> |
| |
| 3. NUMA |
| |
| In conjunction with the increase in the number of cpu cores per die |
| has been the desire to reduce the amount of interconnect traces |
| between cpus and memory nodes (as you pack the cores tighter, you have |
| less room to run connections to memory and other cores). One way to do |
| this is to route a core's address/data/signal lines to some sort of |
| switch module, such as AMD's HyperTransport (HT) mechanism or Intel's |
| Quick Path Interconnect (QPI). With a series of switches in place many |
| cores can access memory and other cpu resources through the switch |
| network without programs running on them knowing they're going through |
| a switch. This results in a Non-Uniform Memory Access (NUMA) |
| architecuture, which means that some memory accesses will take longer |
| than others due to traversing the switch network. NUMA is great for |
| scaling up throughput oriented servers, but tends to hurt determinism, |
| if the programs are not aware of the memory topology. |
| |
| A --numa option was added to the cyclictest program to use the libnuma |
| library to bind threads to local memory nodes and allocate memory on |
| the closest memory node, so to minimize the time required to access |
| memory. |